A Practical Approach to Monitoring Your Cloud Workloads

By William Darnell, Cloud Solution Architect at Microsoft
Tony Barker, Cloud Solution Architect at Microsoft
Claudia Lopez, Cloud Solution Architect at Microsoft
Mark Graham, Cloud Solution Architect at Microsoft
Lily Satterthwaite, Program Manager at Microsoft

03/05/2023

Being a Cloud Solution Architect is great. We become trusted advisors to many customers across lots of different industries, helping them to be successful and get the best out of Microsoft Azure. A customer will take advantage of the many great Azure resources available to them, assembling these resources in the cloud to implement their particular workload, thoroughly test it, see it all working beautifully and finally prepare to move it into production to start delivering those business value objectives. All is well with the world!

As ‘go-live’ day approaches for your shiny new workload, the focus moves from ‘architectural excellence’ to ‘operational excellence’. Typically at this point lots of questions arise from the operational teams:

Do we have sufficient monitoring and alerting in place?
What should we be monitoring for our Azure workload?
What tools should we be using to monitor our Azure workload and what do we need to do to implement this?

The good news is that Microsoft has a lot of documentation and guidance to help, such as the Cloud Adoption Framework, the Well-Architected Framework and the Azure Architecture Center. These can help you get started with your cloud adoption goals, together with a wealth of information on the many Azure monitoring tools.

This blog is intended to build upon all of this by proposing a prescriptive and practical approach, that anybody could implement with their teams, to answer these questions and provide an approach to get you on the road to implementing a solid monitoring solution tailored to your particular Azure workloads.

We start today with an overview of the process and we will, in the near future, be releasing some specific example scenarios to help bring this to life, addressing some key areas, such as monitoring for networks, monitoring for applications and monitoring for SAP etc.

Finally, remember that this is a continuous journey. Whilst this blog will provide an approach which enables the implementation of a monitoring Minimal Viable Product (MVP), it should be followed by continuous review and refinement of your solution as it evolves and new requirements are identified.

Where do I start?

Any workload that is deployed in the cloud is going to have a lot of component parts that combine to form the overall solution. It’s not any different to, say, a car that has wheels, a gearbox, an engine, a transmission and doors. All of those parts combine together for the overall solution of being a mode of transport that gets you to work and back home. A cloud solution will have networking, maybe some virtual machines, some storage and probably some platform services and applications that all need to be monitored so that you can understand how it’s performing and just as importantly, when a failure occurs, understand exactly where the fault lies.

It is vitally important to ensure quick identification and resolution of anomalies and ensure performance and availability of deployed solutions are maintained within your Service Level Agreements. Azure provides the cloud-based tools to allow you to monitor across all levels of your software stack plus the underlying compute, storage and networking components provided by Azure itself.

With such a wide range of monitoring points, the inevitable issue that all businesses face, is understanding what monitoring services need to be combined to deliver that end-to-end visibility.

A six step approach

You should understand that your monitoring strategy will evolve over time and be careful not to delay by ensuring you have every base covered. Your first objective is to ensure “Observability.” You need to capture some key information about your resources which will allow you to both monitor your environment but also learn for future evolution.

Below are six steps that should be covered to build that baseline of observability:

Step 1: Evaluate Workload: Document the architecture for your workload and list all Azure service that make up the solution.

This is an important first step to baseline your workload and importantly identify all services involved in the solution from the underlying platform (networking, peerings, ingress/egress appliances etc.) through resources (virtual machines, storage, databases, integration service, PaaS services etc.), up to the applications themselves. This is where we clearly define what we should be taking into consideration for an end-to-end monitoring solution. So the output here will likely be an architecture drawing and a spreadsheet listing all of all the identified services.

Step 2: Review Available Metrics, Logs and Services: For each Azure service identify and document all available metrics, logs and other monitoring services.

Azure services already have a wealth of metrics, logs and insights available to use. So the proposal here is, that for each service identified in the previous stage, the already available monitoring options for each should be identified and listed. This gives a great starting point to the question “what should we be monitoring for our Azure workload?” question. The output here will be a list of metrics, logs and monitoring services against each resource.

Step 3: Assemble Requirements: create clear unambiguous monitoring requirements from existing sources and/or newly identified requirements

The previous step should provide food for thought when it comes to deciding what you may want to monitor and some of the things Microsoft would recommend you look at. However, it is likely you have your own ideas and requirements for what you want to monitor and some of these may be covered by the monitoring sources identified in step 2 and others may not. So this is really a very important stage, assembling your monitoring requirements in a clear unambiguous fashion.

You should be able to categorise monitoring requirements. For example, wanting to receive an alert email for a metric threshold breach is not the same as wanting a dashboard showing the variation in that metric over the last 90 days. So you could classify the former as an ‘alert’ category whilst the latter is a ‘performance’ category etc.

As a starting point, you should consider making a list of these ‘User Stories’. A User Story is an end state that describes something as told from the perspective of the person desiring the functionality. It is widely used in software development as a small unit of work. This approach ensures that you capture the “who” as well as the “what” and “why” for the monitoring requirement. You can then categorise your stories into different sections together with a success criteria referred to as ‘Definition of Done’ (DoD). This approach works very well for monitoring requirements. Here are some category examples:

‘Alert’
- Definition: Notification when monitored thresholds are breached
- Format: email, text, alarm console bulb etc.
‘Performance’
- Definition: Variation of a measured value over time
- Format: dashboards (graphs, time series), emailed reports etc.
‘Troubleshooting’
- Definition: Pro-active investigations into specific issues
- Format: logs

With this approach you can write a monitoring requirement like this example:

Title	Action	Comments
VPN Connectivity Alerts	Story	As a ‘Cloud Operations Engineer’, I want to be able to receive an alert notification by email when connectivity from Azure to on-prem over the VPN connection fails, this is so that I can immediately investigate and remediate the issue.
	DoD	• Is triggered when packet transfer from Azure NIC to on-prem NIC over the VPN link fails to arrive.
		• An alert notification email received to the ‘cloud support engineering’ email alias within 15 minutes of the occurrence.

Step 4: Map your Requirements to Metrics, Logs and Services: map each requirement to a metric, log or service that satisfies the requirement

Now that you have assembled your requirements and have a list of all the Azure monitoring sources from the previous steps you can map your requirements to the monitoring sources. This is an iterative process of evaluating available metrics and logs for each of your Azure resources and then mapping which of these meet your requirements. This may result in you spotting new requirements to add to the requirements list as well as identifying where an ‘out of the box’ metric can meet that requirement. The output from this will involve going through each requirement and marking which monitoring sources (metrics, logs, services etc.) meet that requirement or in the case there isn’t an option that can be flagged.

Step 5: Populate Backlog Stories: create clear unambiguous backlog stories for implementation of each requirement

The next stage is to convert the outputs from the previous stages into a list of actual tasks for implementation of the monitoring requirements. The deliverable here will be a list of tasks for each of your requirements for implementation in your Azure environment/landing zone.

These tasks will need to map to the specifics of your Azure landing zone, for example, if the environment is managed through CI/CD pipelines and uses ARM templates, then the tasks could involve the creation of ARM templates to implement your monitoring solution or perhaps you are using Terraform for example or something else. Here is where you will select the tools that meet your requirements and that fit your preference or environment constraints.

Step 6: Manage Data: define data storage and retention policies

By stage 5 you know what you are going to build but the process doesn’t stop there. It is important to understand how much data your particular monitoring solution will generate, where it will be stored, how frequently you will access it and how long you plan to retain it. This will have a direct impact on cost and so it is important to clearly define and optimise your policy for managing this data. The output of this stage will be a list of data stores and alerts with details on how that data will be accessed, retained, archived and deleted.

Conclusions

By following this six-step approach, you’ll be able to effectively monitor your Azure workloads and ensure that they are performing optimally. With the right monitoring in place, you’ll be able to identify and address issues before they become major problems, and you’ll be able to provide a better user experience for your customers.

For a real-world example of how you would monitor networking in Azure, be sure to check out our follow-up article here.