How Microsoft moved from System Center Operations Manager to Azure monitoring

Just another WordPress site

How Microsoft moved from System Center Operations Manager to Azure monitoring

(music) >> Hello everyone, welcome to today’s webinar on how Microsoft moved from System Center Operations Manager to Azure monitoring, my name is Pete Apple and I will be your host for today’s session I’m a IT service engineer at Microsoft I’ve been working in IT for the last 10 years and for the last five, I’ve been working on our Azure programs, helping our applications move from our data centers into Azure and the cloud Joining me today are a set of experts who are going to be talking about our journey and I wanted to allow them to introduce themselves, so Dana, let’s start with you >> My name is Dana Baxter, I’m the manageability service manager for the infrastructure monitoring team in CS & EO >> And I’m Lionel Godolphin, I’m on the shared services team in sales and marketing, engineering, formerly known as the monitoring guy >> And I’m Joe Pirelli, service engineer, also in sales and marketing, working on telemetry and reporting >> Okay, great, glad you all could make it So before we get started, I’d like to tell everybody out in the audience to submit questions through the Q&A window any time during the conversations, I’ll be on the lookout for the questions and I’ll throw it over to our experts to be able to find out information with regards to what you want to understand In case we do run out of time and can’t get to all your questions, we’ll stay behind and post them with the on demand webinar If there’s time at the end, the experts are gonna be able to show some key takeaways All right, let’s get started So let’s talk about where we’re gonna cover today So we’re gonna talk about the core monitoring and decentralizing from the central IT that we had doing most of the monitoring day to day and then talk a little bit about central to federated culture, so understanding how the impacts are for the various teams as we went through the journey and then talk a bit about data and reporting and be able to have the right information in the right hands so people know what’s happening with their environments All right, I think we’re gonna start off with Dana >> That’s right, so this first slide I just wanted to talk about the scenario that we dealt with as the infrastructure core monitoring team, so a few years ago, our leadership team came to us and gave us direction to do two things One was to move from SCOM infrastructure to Azure native services and then also to decentralize our team and move to a DevOps model So when we go through this process of figuring out how we’re gonna move from our infrastructure to our Azure services, you may have seen this image before where it’s a funnel, basically an upside down funnel that when you go through the decision making process of doing your migration, the first thing you do is look at your application and decide if you need to delete it Do you really need that application anymore? So your first choice should be to get rid of that application The second choice that you should make when doing migration is to see if you can move that to an Azure service, a SaaS service, so you’re looking to get high up in that Azure stack The third choice would be to move to a PaaS service Your fourth choice would be to move it directly into IaaS infrastructure and then lastly, what you don’t necessarily want to do but sometimes you have to is leave your application on prem So that’s the model that we follow when we’re thinking about how we’re going to migrate our services from on prem infrastructure into Azure features The second thing was decentralizing our core services into a DevOps model, so basically that is changing our thinking around a large centralized operations team doing all the work for all of our engineering teams and then moving that functionality out into the engineering team so that they can do the work themselves >> So these are all the teams that are running and writing the applications and doing the line of business work >> Right, and that really takes a lot of support from our leadership teams because it’s a big ask to move from a centralized service and then going to our engineering teams and asking them to take on this work themselves, but that’s part of the reason why we were looking at how far could we get up in that stack, so rather than transitioning SCOM infrastructure out to our engineering teams, we’re asking them to instead adopt native Azure features which are easier to manage and use than it would be to deploy their own SCOM infrastructure >> That also allows them to deploy it with their code a little bit easier as well, right? And lets the SMEs decide what the monitors are >> True, true So on our next slide, if we could go back to the slide, these are some of the considerations that we looked at when we started thinking about doing the migration So the first thing is that with SCOM we have management packs which make it a little more easier for people

to choose what to monitor as opposed to when you move to Azure alerts, we don’t have that same kind of management pack feature, which is built in, so that’s something that we needed to consider We had a lot of legacy alerting based on outdated criteria so previously in SCOM we had been monitoring for hardware failures and this kind of thing whereas if we move to Azure alerts, you’re not necessarily looking for hardware failures, although you may be doing some monitoring of hybrid environments And then another piece of that is you could have years and years of alerts built up in your SCOM system so say you started monitoring something a few years ago based on an outage, you’ve since fixed that problem but you never went back in and removed that alert from your system, so there was a lot of legacy alerting that we have to look at when we’re thinking about transitioning, for our team, we’re just infrastructure so we do SQL and infrastructure core monitoring so for us application monitoring was out of scope Next is our focus was just on migration so we wanted to look at those legacy alerts that we had configured We’re not necessarily looking to do a lot of improvements or change but just to remove those outdated alerts and see how we can move those from SCOM to Azure alerts and then we wanted to reduce the impact to the application team, so as we moved from that centralized model where we were doing a lot of that core monitoring ourselves and transitioning that to the DevOps model and moving that to the engineering teams, we didn’t want to put a big impact on them, although there would be impact We wanted to see ways that we could help them and then finally we use an internal ticketing system we call ICM and we wanted to reference it here because you’re going to hear ICM throughout our presentation and we just wanted you to be aware that that’s the name of our internal ticketing system So the steps for migration, basically it’s a pretty simple process and it’s a rinse and repeat, we’ve done this now with several of our services, we did this with SQL, we did this with core monitoring, we did this with backups and so basically let’s start at the top, so assessing the current environment So with those ticketing systems, what you really want to do is look at all the alerts that you have received already and start thinking about the scenarios that were around those alerts, so did you have one alert that caused an outage, but it just happened once six months ago, or did you get an alert 100 times last week that you closed every time as no problem found? You could have different severities of the frequencies and the types and that’s really important when you’re evaluating what alerts from SCOM you want to move because you know, if you have these legacy alerts in there or if you have something that’s repeatedly coming up in your ticketing system that doesn’t add any value, you want to scrape those out so it’s looking for the highest value alerts, basically That you’re already capturing and making sure that you can duplicate that in Azure Alert System So once you kinda get that figured out, you get your alert list cleaned up, next you need to go in and design and create your alerts in Azure so with that, there is a little bit of a learning curve because you’re writing queries and log analytics, you’re using KQL, but there’s a company out there called Pluralsight which we worked with to offer a free class on how to write KQL so that that can help you get started so I’d really recommend going to the Pluralsight page and taking the class for that If you have any existing SQL skills or any other scripting skills, they’re really gonna translate really well That shouldn’t be too difficult, but designing and creating your alerts in Azure, it does require that ability and again, you’re gonna base that off of the alerts that you prioritized from the first part of the exercise So, once we did that, so we actually reviewed, we had about 100 alerts that we went through from core infrastructure and SQL monitoring and worked it down to about 15 each, so about 15 SQL alerts and 15 infrastructure alerts and that’s what we created in Azure, so we did a period of testing on those and such and then we created automation and documentation so we created a script that would create those alerts for people and then gave them documentation on how to run those scripts, because again, we wanted to reduce the impact on the application team, so giving them a script that would create the alerts for them can help people get started a little bit easier >> So, we kind of created a toolkit for them Taught them to fish >> Yes, we taught them to fish So, we give them the toolkit and then they can modify that toolkit and add other alerts to it if they want to Or just take it for what it is and then grow it from there

And then once we had that toolkit and we had this planned to go out and tell people, hey, our centralized team, we’re no longer going to be doing all this monitoring on your behalf and now this responsibility is being transitioned to the engineering teams, we needed to do a lot of communication and training to help people with that, so there was a lot of one on ones, brown bags, emails, we have a lot of newsletters and this kind of thing that goes out Did a lot of working with our teams to make sure that they understood that one, they were going to be responsible for doing the monitoring going forward after the date was set, and two that we were giving them some technology that was gonna make it easier for them to actually adopt the services And then finally we disabled the services and then decommissioned our infrastructure, so like I said, we’ve done this a few times on a couple of different services, we’ve done it for backups, we did it for SQL and we did it for core infrastructure monitoring So I’m just gonna show a demo really quick This is basically, this is just a demo of our toolkit that we have, so what you’re gonna see here is an empty workspace, this toolkit assumes that a workspace is already created Here we’re executing the script, we’ve passed in parameters, logging into Azure and you’re gonna see that in this particular toolkit, nine alerts are gonna get created and here, the reason why I wanted to show this script running is because if you would have thought a couple of years ago, if you would have said, okay, we’re gonna decentralize our core monitoring service and now the engineering teams are going to have to start monitoring themselves, that would have entailed deploying infrastructure, installing SCOM, having SCOM expertise, knowing about management packs, being able to set up all the monitoring themselves, when today, if I just give someone a toolkit, they can set up monitoring in a couple of minutes and our team, it’s really important to us to reduce that impact but give people guardrails and a place to start So with this toolkit, it just makes it a lot easier, it just lowers the entry point and it really shows how much it’s different from using SCOM infrastructure whereas you can set up nine alerts in two minutes rather than a whole entire process of deploying SCOM infrastructure and having a SCOM expert and all this kind of stuff You have an application engineer or a developer who can understand the kind of alerts that you need to monitor for your application and they can very easily put those alerts into the toolkit and deploy this over and over and then the other thing is that we’ve put this out on GitHub, so we’re looking to share this kind of functionality where it’s really easy to, as an operations team, as a central operations team, rather than doing this monitoring for people, we can do things like create these toolkits with the alerts already built in and make it very fast for other people to adopt So here you see back on the workspace screen that those nine alerts are just created and there you go, it was as easy as running that script to get some monitoring in place for the servers that are on this subscription >> And I noticed as it was creating it that it is in a very OS related, these things you would expect out of core monitoring, in fact, even some HD related things for a particular piece of hardware, so it looks like you can monitor both virtual machines but then also hardware systems as well >> Yeah, so with our environment right now, where I think we’re about 60/40 IaaS and then on prem VMs and so it’s still important to us to have monitoring for on prem servers for actual hardware, so this particular toolkit allows for monitoring a hybrid environment >> Nice I know that’s important for a lot of enterprises >> Yeah, yeah, we’re not all Azure yet either, so we still have to look at that kind of stuff And so as far as the toolkits go, the next steps, I think I mentioned earlier that it was basically just a parity activity where we were looking to prioritize our alerts and then make sure that those alerts were ready in the toolkits and in Azure, but going forward, we’re going to be doing improvements to those toolkits and alerts, doing some fine tuning and then looking into doing some automated remediation, and then we’re working with partners on getting feedback on what kind of alerts are important to them so that we can incorporate the toolkit and hopefully get it to a point where it does provide some best practice and guidance for people who want to start out using Azure alerts for the first time >> So I have actually a couple questions Wondered how many hours do you think it took to do the alerts or creating an alert? Somebody wanted to create an alert, how long do you think it would take, is it looking at the event log,

understanding what particular event you want to trigger on and then creating the code? >> Yeah, because a lot of it was just translating was happening in the management packs that we had had selected before into how do we find those out of the event logs and log analytics >> Okay, and then another one was if there’s some sort of bigger outage or network outage, that still goes to central, or do the individual teams work on those as well? >> So well, let’s say network team for instance, they do their monitoring of their own devices so you could sort of think of the network team being an individual engineering team, so in the DevOps model, the network team is continuing to monitor the network and the network devices >> Gotcha So each of the functional areas basically become their own DevOps team >> Yeah, so an application team will monitor their own application and they might also monitor network related events that is related to their own stuff, but then yes, the network team would continue to monitor their own because they obviously have a lot of on prem devices as well, so they’re doing the same kind of hybrid monitoring >> Okay, all righty >> Thanks, and we’ll go to Lionel >> Thank you >> All right >> So actually, I think you did a great job talking about the nuts and bolts of getting from SCOM to Azure monitoring so if I may, I’m gonna talk a little bit about the cultural aspects of going from a central system to a federated system and what are some of the cultural aspects that you have to drive because it turns out there’s quite a bit At least, there was for our team We’re in an application space, so we’re doing a lot more of PaaS resources, we’re doing a lot more of less so much VMs, we do still have VM’s, but they’re on the decline for us and we don’t have any on premise at this point, so I’m gonna talk a little bit about again, why we chose this central, the completely federated experience, how we did our alignment and our mapping because you have to do something like that in order to manage in a federated environment How we communicated the quality to our team and who we were passing the torch to handle the quality of the monitoring and finally, I’ll show you how we got the service engineers and developers to handle their own alerts, so why do we go to a complete federated environment, I kind of hinted at it, I mean, it’s the way the product is built, right, Azure is built as a federated system Our engineers were already putting in application insights monitors for the most part and so even though we considered, and this wasn’t a straight line, by the way, we kind of realized that this was the path we were on, I’m gonna say about halfway through the path, but if you’re going from SCOM to an Azure monitoring system, you’re on this path in some way and either you’re trying to go to a central log analytics and trying to do some of it yourself and you’re hybrid but there’s gonna be people that are handling their own monitoring, there’s gonna be people that are, they’re gonna be innovating in that space and being very closed, doing a very closed innovation cycle so we actually decided we were gonna go completely federated because again, people were already doing that It matched the product road map as far as we were concerned and so we wanted to do something that was very light touch as well, so what does that look like? When we were SCOM and when we were central, our advantages were we were in the middle and we knew where things were, we were both setting the stage for the processes and we were also doing the implementation so we knew where things were, we knew what people had, our biggest challenges at that time were managing churn Was getting churn from the outside, bringing it in or providing self service that the people could, we could push the configuration out to where the churn occurred And then in the new environment, though, it’s a complete mirror image, so if you think about it, self service is everywhere in Azure, people can do anything they want to and they’re able to innovate on a very rapid schedule, this is just one of the benefits of the cloud and we become much more of a consulting arm We’re no longer the central monitoring team, we’re much more about helping them do that, reducing toil, we took a larger scope in this case, so while we’re doing the processes and they’re doing the implementation, so our biggest challenges now are what are people doing? How are they, are they managing to a pattern or to standards, are they leveraging each other, are they sharing some of their work back and forth? How do we make sure everybody’s covering completely and in a regular way, so that was our, that was a different paradigm for us So when you’re working in a federated environment, the first thing you’ve got to do is decide where your alignment is going to be You’ve got to have some sort of mapping We have over 50 services, just a ton of working parts,

we needed to have some way of getting a bird’s eye view of what was happening across our organization So we decided for our benefit that we were going to do a custom connector between our Azure monitors, notice, we were gonna give them all the autonomy they needed in the monitoring space, which they were already doing, which was terrific and then we were going to create an alignment layer on the way to ticketing This gave us a number of benefits, you could do this with standard connectors, especially if you’re in a smaller environment, but for us, this gave us a number of benefits, it allowed us to enhance the alerts as they were coming through our system with different tagging and standard tagging, it allowed us to avoid installing a CMDB system that would, or a catalog or cells that we would be driving alerts from, in other words, we wanted to make sure that we had alignment, we wanted to make sure we were using out of the box as much as possible and the more you’re central on a federated system, the more customization you tend to have to do in order to make that work with all the mapping that you have to do in order to get to the ticketing with that kind of a bubble up and then drill down troubleshooting experience, you need that mapping in order to figure out where things are when they come into your ticketing system >> So a lot of customers and ourselves as well have, CMDB is a big part of the processes that go on so how did you remove the need for the CMDB with incident management and the various things that people can easily think of? >> Yeah, I mean, that was a big debate We started out, as I say, with the idea of a central log analytics, and a lot of people have made that choice We just didn’t want to go back into a ticketing environment We didn’t want to be the roadblock We wanted to allow the innovation We knew that monitoring was churning and evolving in Azure at a rapid pace and we didn’t want to lock them into one way of doing things What we did need, though, is this portfolio level presence so we tried a number of different things and what we ended up with is, essentially the catalog is still there, we just did it in an endpoint system, so what we did is we, with Azure functions, we created an endpoint for each service that would map to a tenant in our ticketing system and then we would enhance each of those endpoints with already the tagging that we knew mapped to each service and then we allowed them options later on in the web hook to say okay, this is not just my service but this is my component, this is my service option or a business process, this is the kind of resource I’m monitoring, this is the kind of, you know, just a number of different things or I want to manage correlation better We even got a little more sophisticated as we went along there, that allowed us to handle our path to ticketing It allowed us to handle standard logging which gave us that bird’s eye view, so when everything went into one place, all the logging went into one place with the proper tagging, we were able to get a bird’s eye view and help them manage now some quality and do some kind of governance and it allowed people like Joe to do some real great end to end experiences that were so much more possible now that we had everything in one place, right? >> Right, and so this bridges to, we use service tree as a way to manage our taxonomy and our engineering processes, right, so our code, our alerts, everything now is based on service tree and this is one of the key pieces is it creates that data trail and that consistency from a priority and severity standpoint as you see in ticketing, but also mapping to service tree and the taxonomy that our organization uses to get, deliver features >> And service tree is an internal system for tracking services, I think? >> It is, it’s an internal system and it tracks your service line, like your high level business processes in various layers and drills down and correlates back to your Azure subscriptions and costs and a couple of other tools, so it is a CMDB of sorts, it’s just we didn’t create a cache of one of our own We’re using the standard, the federated version that >> It’s our catalog, right, so imagine if we had created a whole other version of the catalog in the middle of that, then it’s a big synchronization process and it’s just more glue than we really need to do in that circumstance >> So IT is leveraging the catalog that’s available for the full company >> Absolutely, yes, absolutely And this allowed us to, it is a sense of a cache but it’s only on the service level and that doesn’t change very much and it allows them a lot of flexibility in their monitors to then go ahead and add different tags in and as we go forward and as we get more tagging into the actual alerting, I imagine that this will eventually become unnecessary, but in the meantime, it worked out for us as a larger shop to help us really provide some fidelity

>> So VSO, ICM, telemetry, everything hinges off of that catalog >> That’s right That’s a big win, by the way, for us, from the way things used to be >> Having everything directly tied to the service makes a lot of sense >> Yes, absolutely, so I highly recommend having one source for your catalog and trying to just stick to that one So the other thing that it did for us is we had to change ticketing systems halfway through so having that alignment layer means that we just had to turn the rudder, we didn’t have to get everybody to redo their alerting, we just had to turn the rudder on it It was great >> Okay, so, and did either one of the previous or the new have some ITSM connectors you could have leveraged versus writing? >> Yeah, and that was also a debate, it was the sort of thing where we decided okay, this is going to be our alignment layer and it was going to be alerts and health and also some live site things We started out with alerts And again, we said that, we’re gonna give autonomy in all this space where you’re doing the monitoring, but alerts is where we’re gonna start that alignment so as you express your health out, as you express your alerting out to the rest of the world, that needs to be aligned >> Okay >> So what this does is lower the barrier of entry for the service engineers on the team so they don’t have to be experts in that domain, for example, by moving to a decentralized system, you might get into a position where everybody has to learn a lot more about these technologies that aren’t really relevant to their day to day job so this was a case where they helped deliver that functionality so that we could just leverage that as SEs and very easily just consume that in a consistent way and then we can focus more on our application, our business options, the services we truly provide >> And it was light, as far as customizations go We used all Azure out of the box and the only thing we built was an Azure function that receives the alerts >> Oh, those get together really quick >> And they get together really quick and then we stepped out and we said, okay, if you’re using application insights, we have a parser for that, if you’re using log analytics, we have a parser for that, it’s nothing more than checking the box and pasting your endpoint, you’re done So that made things a lot easier, a lot faster, moved us along the process So the interesting thing is, though, once we did this, the realization was that we had given people a little bit too much white space People were innovating across the board in all kinds of different directions, which was fabulous, great stuff was coming out of it but it wasn’t aligned It wasn’t, there weren’t standards, there weren’t as many standards being followed and I can see, it was sort of like herding the cats They were all going in different directions and if we didn’t If we didn’t give them some guide rails, then it would be hard for us again to manage as a portfolio and to help them and to actually start reducing toil and leveraging the systems that we could put into place to help them, so we started pushing in on that autonomous space, we put in some standards, some documentation, we had all of this domain knowledge from when we were handling the implementation and we had to dump that down into a documentation tome with processes and all of that, that took some time and it also, and then once we did it, we had to do it again I mean, it was just, it takes a while to get that to happen, we also started out with some taxonomy, this was key, we talked at them about what component alerts were, what system alerts were, these are the typical ones we all think about when we think about monitoring and what’s that compared to what are some data driven monitors that are volume, velocity or the kinds of things that let you know if your queue is building up and those kinds of things, which tend to be much better alerts, especially if you want to arrange them as a site down kind of representation If you really want to tell what your system is going, you don’t generally look at the redundant widget that’s recycling, you know? And then we started following coverage We introduced systems like FMEA, the failure mode events analysis that looks at the business process first and then breaks down that process in terms of what could go wrong with the components that support it and that turns out to be a great way of looking at your business processes, which are after all what supports your customers which is what we’re trying to do in the first place >> So it sounds like part of this, though, was having to teach some of the folks a little bit more about monitoring concepts and the different, what was difficult about that or what was the toughest? >> Absolutely, well, you know, it was really passing the torch, we had been handling the water quality, if you will, of this whole issue for so long and a lot of that information was tribal so I think the hardest part was getting that all down Starting to talk about it in processes and systems because when you’re actually federating that, you need to be thinking about processes and systems

as opposed to just a nuance that you know in your head >> Sure >> And furthering that point is like, the traditional way or the natural way for a lot of engineers, myself included, is to say, hey, I’m gonna cover this and make sure my code is doing what I expect it to do But that isn’t necessarily success kind of for the business or the capabilities or processes that you’re enabling so one of the benefits to that is it really forces more critical thinking around the output delivery as opposed to the input Measuring the inputs can be a good leading indicator, but measuring the outputs is a fairly authoritative indicator of success or failure >> And that’s a good example of how we grew as a team because if you think about it, most people just think about the monitors as being just the widgets and are they up or down or those kinds of things And we’ve grown a lot as a team by talking about this taxonomy, what does a complete monitoring picture look like, so I’m feeling comfortable now that I could talk to most anyone on my team and ask them what their T4, in this case, data driven measures, alert story is and they’ll actually tell me They’ll actually talk about it in terms that are much more sophisticated now >> I think people are doing a lot better job of monitoring now too because when we had the centralized infrastructure teams, I think people sort of expected that there was this uber team, doing things and taking care >> Covering everything >> Yeah >> Exactly, everything >> And a lot of, we had a lot of vendor teams, and a lot of resources monitoring those alerts and making sure that, trying to contact people and were they valid and following up on them and a lot of times, you had these engineering teams that are just like yeah, yeah, that’s infrastructure stuff, I don’t care as much about that But now that that has been transitioned into this DevOps model and you guys are doing a lot of your monitoring end to end yourself, I think the quality of the services are improved and then the quality of the monitoring is improved so people aren’t getting bothered about every little thing The stuff that you get bothered about is stuff that you actually want to be bothered about >> And it really brings it home As you say, people were throwing it over the wall It’s their problem, I don’t have to deal with it That whole paradigm is breaking down It’s now coming much more together and this is one of the ways that we force that cultural change in our team in order to actually have people think about manageability, so we have a lot more devs in the conversation talking about manageability and it took this kind of communication, this kind of conversations with them and it took a long time It took office hours, it took having sort of mini seminars >> Counseling, I provided therapy to people >> Emotional support >> Emotional, yes >> There was a little of that as well And it took laying out really what is quality to start with and having them feed back as to what they thought quality was and putting that back into the picture >> Well, in the DevOps mindset, I know that even for myself, when we first started talking about DevOps, it didn’t necessarily make a lot of sense and then trying to finally internalize that concept and now you’re out trying to communicate that to other people, like what does that mean, to not have a centralized team anymore, what do you mean, I am doing that? >> It means accountability is what it means >> Yeah, and it does, it means ownership >> Reduction of toil, for example >> Well, that, yes, yes, absolutely and we try to help, again, our nature has changed a lot We’re now the shared services team One of our big jobs is helping them do exactly what we used to do >> Right, I tell people that they kinda have to change their verb in their job description, so instead of saying I do this work, you know, I do monitoring for people, is that I enable others to do monitoring >> Absolutely >> Well stated >> Very well put, and on that note Once we got this underway, once we started having those kinds of conversations, we were able to take the next step which was the real kicker and that is we started having all engineers, service engineers and software engineers participating in the maintainability of the sites themselves, that was the real kicker, that started the virtuous cycle of responding to tickets, either improving the tickets themselves or coding our way out of problems that actually existed in production which then led to the improvement of quality and production all up and it does so on a very rapid basis Once people are really invested in the quality of it from a manageability perspective, then we got all kinds of great benefits We got increased auto detect versus the number of times the users, heck, we got a reduced, overall reduction in our major incidents all up and it was all due to the fact that the people who were writing the solutions were really paying attention to this and had a tight circle of innovation I will never, as someone on a monitoring team, understand your solution the way that you understand your solution and as soon as you get them invested on it and it’s worth it, takes time, but as soon as they

start investing in it, then they’ll see things that you never saw, you’ll never have in a monitoring interview, it’ll never come up in an abstract way >> Well, even our own team, even our own monitoring team, we had a tier one, two, three, four vendor support team and there was a lot of stuff that we just– >> Whole lot of separation >> That we just ignored it, you know, because we knew that tier one, two, three or four was going to take care of it, rarely did we actually get that escalation Now I get an incident now that’s severity three, it’s no big deal >> But it bypasses >> I am on call every three weeks >> I’ve gotten woken up already, it’s not fun >> The key there is you’re bypassing tier one and two entirely, you’re going straight to an actual engineer who can solve the problem I would say that those other tiers are good for end customer, external customer interactions, but for engineering type of fixes, they’re actually a waste of time and bringing this front of mind really puts it on the engineer’s doorstep as opposed to a deflection tool >> Well, we used to have so many that were just, we’d get the same alerts over and over again in our lower tier teams, we just close them as I checked it, the problem is gone >> Tier one gets paid by the ticket sometimes >> They do, yes >> Yeah, I mean >> I don’t >> Let’s just put it this way, it’s worth breaking down those walls and it’s worth actually putting people who have an opportunity to code themselves out of a problem, that is to say we’re gonna put in a fix for this It’s not just gonna happen and happen We’ll close it, but we’re gonna fix this, not only the monitor for it, not only do a good job with our aggregations and all of that, but actually fix the code that solved the problem in the first place, now we start talking about service reliability engineering, and its relationship, its in depth relationship with the feature teams, all of our SREs are embedded into the feature teams and they have an actual backlog of things that are manageability oriented and the devs are participating within that That has been a huge win for us and it’s also been a huge win for people like Joe who’s been able to now take it and run with it, I mean, not just the, really looking at what the customer sees and after all, that’s the reason we build all these solutions in the first place, right, is for customer processes, he’s been able to take that and run with it >> It is >> So I’ll let Joe show That part of it as a matter of fact >> Thanks, Lionel Well, and here’s one example of that, thanks a lot, Lionel, and it’s interesting how far the conversation has shifted from, I mean, the topic here is moving from SCOM to Azure, but in reality, making this move and using these tools is a key to enablement so that we can stop having that talk, we don’t have to spend as much time talking about monitoring, we can spend a lot more time looking at health One of the ways that we do that, this is a particular view of the availability results, actually, this is the dependency chart, the dependency chart across our entire portfolio so as we move into the Azure monitors and the native Azure space, what we’re able to see is this extremely rich data Is now available to us for multiple purposes That we find very valuable because we can take that data and we can now correlate it with business processes, time of day issues, we can look across the portfolio for things we don’t understand because we know what web tests or what dependencies look like natively We can set up alerts on the exact same data that we can set up reports on and by having that reusability we never get into the conversation where well, it looks green to me, there’s no alert and somebody’s complaining elsewhere because it’s broken >> Where did you get, this is obviously a great portfolio level view, I imagine that you had to get your telemetry, though, from the actual microcosm before you were able to bubble this up, right? >> I do, so in this particular case we’re collecting data from Azure data explorer, there’s a centralized cluster, we call it the Kusto cluster, but the product name is Azure data explorer And all the data for our all our subscriptions goes into this one location so we’re able to stop thinking about different subscriptions at different areas and really focus on the taxonomy that aligns to our business and the things that are important to us On the bottom of the chart, you’ll see the service name and the team group name, that’s how we correlate and that’s how we speak in terms that our stakeholders understand, we say look, marketing automation, which is an area that I focus on, is at 99.91% consumption of its dependencies, and I can drill into that and I can see which dependencies are failing, which dependencies are working, and very quickly this report that I built in an afternoon on Power BI, there’s a native export within App Insights or Kusto, Azure data explorer, where you can export these queries, paste them into a blank query in Power BI and you can have this rich data readily available in Power BI at any time >> That’s actually timely We just got a question in about that

They wanted to know if the dashboard was created using Azure with analytics, but it sounds like it was Power BI? >> No, I chose Power BI So there’s a lot of tools, the reason I chose Power BI is because I have obfuscated most of the data that’s important and I don’t want to provision permissions to all my stakeholders So Power BI creates a clean way for me to publish out a dashboard without giving data access >> Gotcha >> In other cases, I may use Power BI or other tools where I need more detailed access into the data, but for this audience, we kinda all use the same views and so this is really useful >> Makes sense >> Looking at this one, we can see there’s a particular dip on the chart, in the upper right hand corner in the bar chart, you can see that just before six AM, there’s an increase of red and a dip in our global percentage, so when we drill into that, we highlight the red, so on the bottom center, I click, just filters the whole chart on red and then I can sort by the number of failures and see who’s the offender, who has the highest number of calls in the red and in this case, it’s the MPAASync And so I know right away even though that’s not my space, they’re my friend, they’re my partner, we work together so I’m gonna let them know that I see a problem and they can set up alerts on that if they want And this really applies to our whole portfolio Conversely, I would look at my other business process measurements and I would go hey, I see that the sync, it’s taking a long time to do these operations and it’s affecting our stakeholders and they can see that oh MPAA had a problem just before six AM >> Sounds like a virtuous cycle of improvement to me >> It is, it is and so we can, it’s very easy to consume this data and really drive and prioritize impacts and volume very quickly in our ecosystem We have a similar view for web tests, web tests are a very native part of App Insights You create a web test, it goes and checks a webpage And these use all standard taxonomies so they’re really easy to get for free and similar to our dependencies, you can see very quickly that we can correlate and see, look for dips and look for problems in the ecosystem so in this particular example, we did have a problem and we can see that it was the mktoLandingPages that were having a problem and we can see exactly which ones were failing, sorry for the block, we’re covering our URLs there, but you can see exactly which ones were failing and from which data centers the failures were coming from Again, giving us a very quick way to start to debug and triage issues and get them to the right DRI team Programmatically or even rolling up for reporting and other consumption scenarios to our stakeholders >> So no sifting through logs on different servers anymore >> No, we want to avoid sifting through logs Now the beauty of this is all the service engineers are now enabled to use this one view and they can go focus on things that are more custom and more specific to their domain What that does is it really, again, lowering the barrier of entry by having this rich data feed >> So it lowers the amount of work that they have to do >> Correct, and really, these are table stakes There’s no reason why somebody should have to go reinvent this wheel, so we’re able to do this for our whole team, by having a taxonomy that we understand, by using the endpoint, we’re able to create tickets, create reports and create alerts, all on these what we consider facts because they’re truths in our world, these are all truths for us and we all agree on that and then we can go talk about thresholds and other topics which are a lot more productive >> So did you find anything unexpected or any interesting benefits? >> You know what, absolutely, I can tell you that 32 web pages across 16 regions across every five minutes tests is a lot of calls and our number one consumer was actually this one. (laughs) The mktoLandingPages and so we actually stepped back the number of times we check as a result of that We’re able to surface a stale web test very easily because there’ll be 0% success, dependencies are the same story, where we can look at all the dependencies in our ecosystem and say which ones are really low percent Either there’s a problem or they’re noise, but either way they dilute what we’re trying to do which is find anomalies, find pattern changes that cause bad things for us, right? >> Sure >> That’s the role of a service engineer, really, is more about auditing change deviation from expected behavior and understanding if that’s good or bad >> I’d love to point out that this is a massively portfolio level view that’s all based upon instrumentation that starts out at the component and it’s all because we’re able to bring it all together in Kusto and Azure data explorer and be able to visualize it across and so now we have a lot more opportunities for our teams

and we’re starting to talk to our teams about standardization along the lines of what does reliability look like in a system, what is that calculation, you know, how many you tried versus how many succeeded, what was the performance, what are our core metrics that we want to see on every service? And we’re starting to put our quality of service together That it can do more of what Joe’s talking about and really get a bird’s eye view of the entire ecosystem and yet be able to drill down quickly without having to go sorting through logs after logs and figure out where the problems are >> So it sounds like the individual teams can do really well with Azure and the Azure portal app insights, the various things that are available and then for the wider view, you can take an aggregation of the data with Power BI or some other visualization tooling and be able to build this great looking view of this is how the entire environment looks and here is where I might have some problem spots >> Absolutely >> Yeah, and what we found in our space is a lot of times the engineering team and the business stakeholders really want the same things but we don’t know how to speak the same language about it, so when you hear a service engineer or any kind of engineer saying oh no, this is bad, we have to do it, then you hear the business team will say, well, we need this feature so that’s not important That conversation, we can mitigate very quickly with a data driven approach and so one of the things we do now is we’re able to leverage a very heavy data driven approach because now we’re talking about a single source of truth and we’re actually to the point where we’re trying to get away from thinking about this as telemetry or alerting, this is data, this is data about what’s important to us and really telemetry and alerting is just one consumption scenario, we like it when the business uses the same exact data, because then we agree on the same exact truths >> I know that it’s always, one of the things that I have noticed especially since we have moved to the cloud and using the portal and using Power BI is we used to have the concept of monthly business reviews and quarterly business reviews and it was primarily a PowerPoint and so it was never the fresh info This is what patching looked like yesterday when we made the slides Now with all these reports, the sort of things you’re showing, you can have the business owners and the engineers looking directly at the live information >> And we actually have reports in our ecosystem for another talk day, but they’re part of the release readiness for the sprint teams all the way up to the business external stakeholders that are using the exact same set of data That’s immensely powerful because we can all agree on this universal truth, and then we just see, well, we don’t like the way this looks, who can do something about it, who can fix it? And it helps us get to the right teams >> Health is no longer an abstract, it actually has a definition, it actually has a set of measures that everybody’s gonna be using and now we’re actually gonna be driving that conversation from the same, from the same perspective that’s pinned on the same truth >> Now everything is sounding awesome and hunky dory but if you were gonna do it again >> Okay >> If you had to start over, what would you do a little differently? >> Do it earlier >> (laughs) Okay >> Yeah, I mean, stole my thunder there I would have trusted the north star earlier, better You know, when you first embark on a journey like this you’re gonna start out and you’re gonna waffle a little bit and you’re gonna do some things in parallel I would have jumped in with both feet a lot earlier >> Yeah, you have to have, and that actually is a really good point because you can’t just have a couple of people jumping in They’re just gonna get wet It has to be an all in scenario, so you have to describe to people that yes, this is the journey we are on, so for example, if we hadn’t understood that we are going to a fully federated system, it’s going to take this, this, this and this, it’s gonna take knowing about, we could have planned a little bit better to know what our people were doing out there a lot easier We could have the map into us a lot easier, and still allow them the autonomy and provide better guard rails a lot earlier It takes a while to impart the knowledge That is just gonna take some time, but again, the more you have buy in, right, the top down, we had actually great support, I will say that We had people, I mean, management was driving us toward this even I think as we were wondering whether it was gonna land, I mean >> And part of that might have been down to spending a little more time early on understanding the real value proposition and how this can really help you and again, in some ways, making some work just automatic and getting it off your plate, in some ways, reducing time to mitigate or time to detect or time to resolve, right? And then aligning that to where do you get your budget, where does the money for why you exist

come from, because that’s the person you really have to make happy, that’s what drives your product and your engineering cycles >> Sure, and to that point, you have to actually be able, if you know what the journey is and you say look, when we are federated, we’ll get this and this and this, we’ll actually get people that have ownership, we’re gonna see, if you’re talking about less major incidents and happier customers and all, what’s that worth to you? That actually can help drive the conversation to get everybody on board and that’s what you really want You want the support from everybody, you have to do this Was there resistance, and for example, in developers, the idea that they would be actually waking up to an alert at three AM, you bet there was resistance to that That’s why it was so important to have all of the other steps done leading up to that They would be very upset if we woke them up at three AM for a redundant system that was recycling They’d be a lot less upset if it was a data driven curated monitor that said, you know what, your system’s on fire, go take a look They don’t have any problem with that at that point They’re gonna call on all their buddies and say look, let’s get this thing on the rails because they all want the same thing >> Absolutely >> It just makes that conversation a lot easier When you’ve gone through that kind of investment and the learning and so forth >> And the technology is probably the easy and the fun part >> It is >> Which is why we all work at Microsoft, it’s what we’re interested in, but the people conversation and getting teams to transition, I think it was the most difficult >> So at this point we’ve completely removed SCOM, right? >> Yes >> And so how long has this journey been, I mean, from the beginnings all the way to where you folks are all today? >> Now I’m the manageability service manager, which we’re doing this for core monitoring but I used to be the SQL service manager and we transitioned from using SCOM to monitor SQL about a year and a half ago, so that’s how long we’ve been working on it in the centralized IT team >> Okay, and then how long did it take for start to finish, to get here? >> About a year, about a year, I would say >> About a year >> Again, it wasn’t a straight line >> Right, there’s some zigs and zags >> We tried the central hybrid, we tried the tag lookup, we tried all of that sort of stuff, we just said no, that’s not it and we finally settled on the federated experience I think that, again, anybody’s going to Azure monitor is gonna be on that road, it’s just how far do they want to go, we decided it was worth it to go all the way and that drove a lot of our decisions once we made that decision, so once we made that decision– >> And a lot of that is due to the breadth of our portfolio We have things ranging from data systems to web interfaces to transactional processing engines to, there’s a lot of things in our ecosystem so that was a big benefit for us >> Heterogeneous, yes >> And decentralization might not really be a hot topic for a smaller company You know, for us it is, but they might just be looking at the technology piece where they’re transitioning from SCOM to Azure alerts or using some other Azure features and they’re like, well, we can’t really decentralize >> Well, I mean, in that case, they’re changing the technology, the alerts are going from the same team back to the same team because they’re handling it soup to nuts which is just fine It will be faster, easier to manage and more cost effective than having to have that SCOM infrastructure of those folks >> Oh, absolutely, we have a lot less cost in doing what we do and I get to have a lot more breadth in my role now as do you, right? As far as helping people, I spend a lot more time trying to help people reduce toil, to provide good standards, we are looking more at the customer experience rather than we are at the monitors and making the monitors >> Yeah, we talk a lot about providing guard rails and safety nets and making the point of entry easier for people so we’re doing a lot of consulting and we’re trying to do some of the harder exploratory technical work and then also we do a lot of work with the product group, so we’re adopting things, as soon as they become available for piloting, we’re trying to take some of that pain so we can test some of the features before they actually get released out to the real paying customers So and that’s fun as well because we’re focusing now on more brand spanking new technology rather than spending all of our time on legacy software >> Well, and then we get to all come together and tell everybody how we figured it out and that it’s easy to do so that’s one of the reasons why >> Our job’s definitely worth it >> As you move to Azure, don’t forget to look for the things that you get for free, the things that’ll automatically start trickling into your App insights in various areas, your dependencies, your requests, all the native calls that’ll show up, there’s a lot of value there, right away that you’ll get for free so leverage that data, use data to tell your story and to help

with those prioritization discussions >> Absolutely, there’s so much going on in that space right now, we’re talking about there’s a lot more machine learning that’s going into the actual monitoring space, whether we’re actually, hey, your throughput just fell, is that something you want to know about? There’s a lot, there’s application mapping now that people can do as a great input for what they’re going to cover, you know, and again, I think 95%, 96% of the stuff that we’ve been talking about here wasn’t custom, so that’s, I think, one of the big, well, if you think about it, I’ve got one Azure function, you’ve got a toolkit that has scripts >> And personally we use one function because we need to connect to our internal ticketing system >> So there’s still an opportunity for some customization for those that can, but it’s really a big out of the box story with Azure that actually is now forging ahead in this new world of not only using data but also being, we’re providing a lot of the tooling, I think, that is gonna help our people do out of the box what they had to work so hard to do before, which I think is super exciting >> So where can folks get some more info? >> Well They can access the all IT showcase resources, start there >> So it looks like we have the manageability toolkits published out there >> Yeah, we just, we did some presentations in Ignite and we got a lot of feedback about making those available I’ve probably been asked 100 times, how do I get the toolkit and so we finally put those out on GitHub We had to clean them up a little bit and make sure nothing proprietary was in there, but there they are, so you can use that link to get to our toolkits out on GitHub >> Why reinvent the wheel if somebody has one you can use? >> Absolutely >> Sure, and it’s super simple This one just uses the API, we’re gonna explore using the toolkits using ARM templates and then we’ll be doing some more improvements so we’ll continue to update those toolkits, that particular toolkit and maybe add some of our other toolkits as we transition, you know, traditionally operations, we’ve always been in the background and now because of all these changes, we’ve moved more into the forefront, more customer engagements, so these kind of opportunities are arising where we have space to actually publish stuff out that other people will be excited to use >> Now that we’re not making monitors all day, it’s a nice thing, yeah, so I appreciate that >> Well, we’re almost done, so why don’t we just throw one interesting thing or one key thing you learned out of this that you want to have everybody remember as they exit out, so let’s start with Joe >> I said it before, data is king, use the data, use what you have available, agree on a single source of truth, good things will happen >> I would say, go all in, learn to let go of the monitoring, let people actually take ownership I think that that made all the difference in the world Start that virtual cycle of improvement It reaps massive benefits and get your team in on it and do it sooner than later Get going >> Okay >> And I would say, whatever you decide to do, whatever you decide to plan, get your leadership team to support you, getting their buy in is really important and then find someone in your organization who really likes to communicate and get out there and talk and spread that word around of what your leadership team did in fact agree to and then don’t be afraid to try >> And that’s why we have you, Dana Get out there and spread the word >> I do that, don’t I? >> Yes, you do great All right, so, great points, thanks, I really appreciate that you could all come in This was a very interesting discussion I learned some things today so I’m very happy about it The on demand version of the webinar will be posted soon to Microsoft.com/itshowcase, where you can also find a whole bunch of other related content, case studies, blogs, I have a blog out there and look for every opportunity to see any of our upcoming webinars, we’ve got all kinds of topics coming up and be sure to join us and spread the word that this was interesting, tell your colleagues that we’d love to have you learn IT from us here at Microsoft, thanks much, take care (music)