AI brain concept image

Systems Innovation

Towards Highly Reliable Services with AIOps

Share this page

Rujia Wang (opens in new tab), Principal Research PM; Chetan Bansal (opens in new tab), Principal Research Manager; Saravan Rajmohan (opens in new tab), Partner Director AI & Applied Research; and Jim Kleewein (opens in new tab), Technical Fellow

For well over a decade, Microsoft has provided one of the world’s most popular hyper-scale productivity suite, Office 365, which is now part of Microsoft 365. Microsoft 365 includes hundreds of different services running billions of transactions a second on hundreds of thousands of servers in many dozens of data centers worldwide. It delivers day-to-day cloud services to hundreds of millions of enterprise, education and consumer users.

Those services can never be down. Our services are used by hospital and trauma centers, power grid providers, national, state, and local governments, major banks and financial services providers, airlines, shipping and logistics providers, and businesses from the largest to the smallest. To meet their needs, we must be continuously available, which means 100% availability over long period of times. Our services should operate seamlessly through disasters because disasters are often when our services are most essential; to coordinate emergency response.

Therein lies a great challenge. Our extreme scale means that in our services “one in a billion” events are not rare, they are commonplace. At the same time, we cannot allow those “one in a billion” events to compromise the availability of our service. This combination of almost unbelievably massive scale and extreme criticality requires us to continuously rethink and improve every aspect of services architecture, design, development, and operations. One important aspect of achieving continuous availability and highly reliable services is to understand incidents holistically and mitigate their impact to customers.

Beyond using Artificial Intelligence (AI) and Machine Learning (ML) for developing new productive features and capabilities that delight our users, we are also leveraging the power of AI and ML for improving service availability and reliability, which is essential for our hyper-scale services. This article shows one example of applying AI into managing production incident life cycle. We plan to share more examples in future articles.

— Jim Kleewein, Technical Fellow, Microsoft 365
Acknowledgement

This post includes contributions from Supriyo GhoshToufique AhmedManish Shetty, Suman Nath, Tom ZimmermannXuchao Zhang, Yu Kang, Qingwei Lin, Dongmei Zhang.

Introduction

Microsoft 365 (“M365”) is the world’s largest productivity cloud. Hundreds of thousands of organizations of all sizes use it. Whether you’re having a Teams meeting, composing emails in Outlook or collaborating on a Word document with your colleagues, you’re relying on M365 to power these productivity tools and applications M365 is powered by web-scale and massively distributed cloud services with exabytes of data handled by O(100K) servers in O(100) of datacenters around the globe. To ensure best-in-class productivity experiences it’s critical that our engineering infrastructure is highly reliable while being efficient at the same time.

Here at M365 System Innovation research group, we leverage the power of AI and integrate Cloud Intelligence and AIOps into our services and products. We are using innovative AI/ML technologies and algorithms to help design, build, and operate complex cloud infrastructures and services, and provide a step function improvement in operational efficiency and reliability enabling us to deliver best in class productivity experiences. We are applying AIOps to several domains: 

  • AI for Systems to make intelligence a built-in capability to achieve high quality, high efficiency, self-control, and self-adaptation with less human intervention. 
  • AI for Customers to leverage AI/ML to create unparalleled user experiences and achieve exceptional user satisfaction using cloud services. 
  • AI for DevOps to infuse AI/ML into the entire software development lifecycle to achieve high developer productivity. 

Helping build highly reliable cloud services has been one of our key focus areas. One of the challenges with that is to quickly identify, analyze, and mitigate incidents. Our research starts from the fundamental of the production incidents: we analyze the life cycle of incidents, understand the common root causes, mitigations, and engineering efforts for resolution.

Understanding Production Incidents

diagram
Figure 1: The overview of service reliability problems in large-scale cloud services

Our award winning paper provides a comprehensive multi-dimensional empirical study of production incidents on large-scale M365 cloud used by Microsoft Teams. Since Microsoft-Teams powers real-time communication, reliability is paramount. Understanding production incidents, from detection, root-causing, and mitigation perspectives, is the first step to build better monitoring and automation tools. Figure 1 shows the overview of service reliability problems on large-scale cloud services, summarized by our research paper.

Common root causes and mitigations behind Incidents

piechart
Figure 2: Breakdown of root cause analysis (RCA) and mitigation categories

While code bugs are the most frequent cause of incidents, majority of the incidents (~60%) were caused due to non-code/non-config related issues in infrastructure, deployment, and service dependencies. We also observed that among the 40% incidents that were caused by code/configuration bugs, nearly 80% of incidents were mitigated without a code or configuration fix.

TTD and TTM for root causes and mitigations

RCA categories
Figure 3: Average TTD and TTM for different root causes categories
Mitigation steps
Figure 4: Average TTD and TTM for different mitigation steps

The TTD and TTM of incidents caused by code bugs and dependency failures are significantly higher than other incidents. Also, 30% of the mitigation delay is caused due to the manual mitigation steps.

Takeaways

(1) Incidents caused by software bugs and external dependencies take longer to detect due to poor monitoring. This highlights the need of practical tools for fine-grained, in-situ system observability.

(2) Incidents caused by some root-cause categories are quick to mitigate after their root-cause categories are determined. This suggests that the overall mitigation time of incidents caused by these categories can potentially be reduced with tools that can quickly identify its root-cause category.

(3) Incidents caused by some root-causes are inherently hard to monitor automatically (e.g., that requires monitoring global states). This suggests that developers should invest more in testing to uncover those root-cause categories before production, thereby avoiding such incidents.

We also envision that automation should be the future to do incident diagnosis and identify the root cause and mitigation steps to help quickly resolve the incident and minimize customer impact. Also, we should leverage the past lessons learnt to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art ML models, such as large language models (LLMs) can help achieve both the goals.

Using Large-Language Models for Automatic Incident Management

Recent breakthroughs in AI have enabled Large-Language Models (LLMs) to have a riche understanding of natural language. They have become good at understanding and reasoning from large volumes of data. They can also generalize across a diverse set of tasks and domains such as code completion, translation, Q&A. Given the complexities with incident management, we were motivated to evaluate the effectiveness of these LLMs in helping root cause and mitigate production incidents.

flow diagram
Figure 5: Leveraging GPT-3.x for root cause analysis and mitigation

In our recent work which we will be presenting at ICSE 2023 Conference, for the first time, we demonstrate the usefulness of LLMs for production incident diagnosis. When an incident is created, the author would specify a title for the incident and describe any relevant details such as any error messages, anomalous behavior and other details which could potentially help with resolution. We use the title and the summary of a given incident as the input for LLMs and generate root cause and mitigation steps.

We do a rigorous study on more than 40,000 incidents and compare several LLMs in zero-shot, fine-tuned and multi-task settings. We find that fine-tuned the GPT-3 and GPT-3.5 models significantly improves the effectiveness of LLMs for incident data.

Effectiveness of GPT-3.x models at finding root causes

Table 1: Lexical and semantic performance of different LLMs
Table 1: Lexical and semantic performance of different LLMs

In our offline evaluation, we compared performance of GPT-3.5 against three GPT-3 models by computing 3 lexical similarity metrics between the generated recommendations and the ground truth of root cause or mitigation steps mentioned in incident management (IcM) portal. The average gains for GPT-3.5 metrics for different tasks are as follows: 

  1. For root cause and mitigation recommendation tasks, Davinci-002 (GPT-3.5) provides at least 15.38% and 11.9% gain over all the GPT-3 models, respectively, as shown in Table 1.
  2. When we generate mitigation plans by adding root cause as input to the model, GPT-3.5 model provides at least 11.16% gain over 3 GPT-3 models.
  3. We observe that LLM models perform better on machine reported incidents (MRIs) as opposed to customer reported incidents (CRIs) due to the repetitive nature of the MRIs.
  4. Finetuning LLMs with incident data improves the performance significantly. Finetuned GPT-3.5 model improves the average lexical similarity score by 45.5% for root cause generation and 131.3% for mitigation generation tasks over zero-shot (i.e., inferencing directly on pretrained GPT-3 or GPT-3.5 model) setting.

Looking Through the Incident Owners’ Eyes

In addition to analytical analysis with semantic and lexical metrics, we also interviewed the incident owners to evaluate the effectiveness of generated recommendations. Overall, GPT-3.5 outperforms GPT-3 in majority of the metrics. More than 70% of OCEs gave a rating of three or above (out of 5) for the usefulness of recommendations in a real-time production setting.

Looking Forward

While we are at the initial stages of using LLMs to help automate incident resolution, we envision that there are many open research questions in this field that will significantly increase the efficacy and accuracy of LLMs. For instance, how can we incorporate additional context about the incident such as discussion entries, logs, service metrics and even dependency graphs of the impacted services to improve the diagnosis. Another challenge is regarding staleness since the models would need to be frequently retrained with the latest incident data. To solve these challenges, we are working on leveraging the latest ChatGPT model combined with retrieval augmented approaches to improve incident diagnosis via a conversational interface. For instance, ChatGPT can assist engineers to efficiently determine the incident’s root cause by raising hypotheses and answering critical questions with a feedback loop.

diagram
Figure 6: Workflow of Retrieval-augmented RCA

Moreover, ChatGPT can be actively integrated into the “discussion” of the incident diagnosis. By collecting evidence from available documents and logs, the model can generate coherent, contextual, natural-sounding responses to inquiries and offer corresponding suggestions, thereby facilitating the discussion, and accelerating the incident resolution process. We believe this has the potential of delivering a step function improvement in the overall incident management process with contextual and meaningful root causes analysis and mitigation thereby reducing significant human toil involved and bolstering our reliability & customer satisfaction.