Applying the power of Azure Machine Learning to improve SAP incident management

May 31, 2018   |  

Two businessmen consult Microsoft Surface Book.

One important aspect of digital transformation is embracing modern technologies and processes that can improve the customer experience. Microsoft found a perfect opportunity to do this—we used Azure Machine Learning and AI to automate the triage component of our SAP incident management process. Our solution reduced the mean time to resolve SAP user issues, increased incident routing accuracy to 99 percent, and freed staff to focus on more strategic aspects of their roles.

As enterprises continue to move into the digital world, an ever-increasing portion of their operations can benefit from leveraging the cloud, thus helping to improve scalability, enable a mobile workforce, and reduce data storage costs. But what about the human factor? As a technical decision maker, have you considered how going digital can also drive better customer service?

At Microsoft, we’re continuing our digital transformation journey, where our IT and product teams regularly collaborate to identify and solve challenges that exist within the enterprise. One example is the recent joint initiative of Microsoft Core Services Engineering and Operations (CSEO) and the Azure product team: we incorporated AI and machine learning (ML) technologies into our SAP incident management process to improve support ticket routing accuracy and significantly reduce incident resolution time.

Drive to improve SAP incident management

Our Operations organization at Microsoft continuously searches for ways to make processes more efficient and to improve the user experience by providing self-service solutions or implementing self-correcting routines that prevent emerging issues before they occur. We’ve learned that the more complex the process, the bigger the challenge—and, when we build a successful solution, the greater the reward.

SAP incident management is one such example of a process we identified for improvement. Supporting our SAP users requires a wide variety of domain-specific knowledge. Our SAP Support personnel are divided among several teams, each specializing in a particular functional or technical area. The sheer scale of our company—Microsoft has more than 125,000 employees, plus customers, vendors, and partners who all touch an SAP system at some point in the course of doing business—means that our SAP Support teams handle thousands of incidents each month.

Traditionally, a user’s SAP incident would be first triaged by a support staff member to determine to which of five different SAP support groups the incident should be routed: SAP Technical, SAP Human Capital Management, SAP Supply Chain Management, SAP Business Intelligence, or SAP Finance & Master Data Governance. The incident would then be placed in that support team’s queue to be resolved. As part of our ongoing efforts to improve SAP processes, when we reviewed incident management processes and operations with the SAP Support staff, we discovered an inefficiency in this incident-routing process. As illustrated in Figure 1, for email requests sent to SAP Support, our analysis revealed an average 30-minute delay between the time a new incident first landed in the assignment group queue and when a staff member assigned it to the appropriate SAP Support team’s queue.

Figure 1 shows the 30-minute gap between SAP incident reporting and the incident being routed to the appropriate team.
Figure 1. In our previous system, an incoming request for SAP Support would sit in a holding queue for an average of 30 minutes before a person reviewed its details and routed the incident to the appropriate SAP Support group.

This delay impacted our mean time to resolve (MTTR) values and was affecting our internal customers’ user experience. One potential approach to address this issue could have been to provide additional personnel training that emphasized the importance of becoming more efficient at triage. However, we saw this issue as an excellent opportunity to automate the triage component by using AI and Azure Machine Learning.

 

The opportunity: automate via Azure Machine Learning

Machine learning is particularly good at prediction, classification, and anomaly detection, and incident routing is all about classification. Therefore, incorporating Azure Machine Learning into our SAP incident routing was a natural fit—especially when considering these criteria:

  • Leverage an available rich dataset. A good dataset is critical for training AI—and the bigger the dataset is, the better. Without robust, readily available data, any AI project is considerably more challenging and costly to develop. Because we had access to the historical data of all the SAP incidents, we knew that we had sufficient information to create an accurate computer model.
  • Alleviate or eliminate human-based errors. Automating the triage component with Azure ML gave us the potential to significantly reduce—or possibly eliminate completely—the time delay, as well as improve the system accuracy. Any other method that focused on support staff retraining could never approach the speed or accuracy of an automated, AI-based solution.
  • Maximize benefit while minimizing cost or operational impact. AI and ML are services that are readily available in Azure. We could use Azure Machine Learning Studio to easily insert the AI features into our existing SAP support process and gain the benefits of automation without having to stand up additional servers or train personnel on new systems.

Applying data science to two different methodologies

After we decided to solve the triage time delay by using AI, we engaged our Microsoft Data Science team to design the AI model. In this case, however, our data scientists decided to approach the solution using two different methodologies, as detailed below.

The knowledge base approach

The key assumption in this approach was that we could repurpose the data dictionaries—lists of keywords used for human triage and for routing incidents to each of the five SAP Support groups. Building a model using this type of approach required an existing dataset that included a data dictionary containing all information we needed to extract—and in this situation, we had exactly that. We provided the data scientists the existing data dictionaries for each of the five routing queues.

The following steps summarize the process that our data scientists used to develop the knowledge base model.

  1. Data collection. The data scientists collected a dataset that comprised 13,801 SAP incident tickets, spanning approximately 1.5 years. In addition to the incident data, the data scientists used the existing data dictionaries previously described that contained the keywords specific to the different SAP Support queues. These data dictionaries were derived from a dataset that technical support personnel created to help the SAP service desk personnel review incident descriptions, identify keywords, and route an incident to the matching SAP Support queue.
  2. Preprocessing. The data scientists performed text preprocessing to “clean” this historical incident data. To do this, they used the R programming language to automate conversion of all words to lowercase, and to remove extraneous material including punctuation, numbers, extra spaces, and non-keywords such as andorthe, and others.
  3. Processing. At this point, the data scientists allocated a random sampling of 80 percent of the historical dataset as the in-sample data, which would be used to build the knowledge-base model. They used R to automate extraction of the salient words from the cleaned data and matched these words and phrases with the keywords in the data dictionaries. The resulting mapping would then determine each incident’s routing, based on the incident’s identified keywords.
  4. Results. When the knowledge base model was integrated into the incident routing system and back-tested with the out-of-sample data (the remaining 20 percent of the historical dataset that wasn’t used previously to build the model), it achieved an accuracy of 59.3 percent.

The Azure Machine Learning/AI approach

The key assumption in this alternate approach was that we didn’t have any data dictionary for each SAP incident queue. Although the data scientists had the same dataset to use as with the previous method, that was their only source of information. In this situation, AI had to be used to build the model.

The following steps summarize the process the data scientists used to develop this AI model.

  1. Data collection. The data scientists used the same dataset comprising 13,801 SAP incident tickets that the previous method used.
  2. Preprocessing. After obtaining the data, the data scientists performed text preprocessing to clean the data in a manner similar to the knowledge-base approach described previously. However, they used Python scripting in this case instead of R to automate the process.
  3. Processing. After preprocessing, the data scientists needed to do some feature engineering to answer the following question: How can we transform the unstructured data, such as the text that’s in our dataset, into structured data that can be fed through Azure Machine Learning? After testing several different modeling methods, the data scientists decided to use the feature hashing method, which uses a hash function to hash textual data into numerical values. Once they had structured data, the data scientists built a deep learning-based AI model using the in-sample data (the random sampling of 80 percent of the historical dataset) as was described in the previous section.
  4. Results. When the AI model was integrated into the incident routing system and back-tested with the out-of-sample data (the remaining 20 percent of the historical dataset that wasn’t used previously to build the model), it achieved an accuracy of 85 percent.

Hybridizing two different approaches into an integrated model

We now had two viable models that we could apply to automate the triage component of the SAP incident management process. Although the AI model demonstrated a significantly higher accuracy rate than the knowledge base model (85 percent versus 59.3 percent), we recognized the effort that had been put into developing each approach. Could there be additional value in taking the learnings from both initial models and creating a new integrated model? What accuracy rate would we achieve from the synergy of the two?

To explore this possibility, the data scientists generated features from the knowledge base model using similarity matching and generated features from the AI model using hash functions. These features were then combined and run through Azure Machine Learning to generate the integrated model which, when tested with the out-of-sample data, resulted in an even higher 93 percent accuracy rate—significantly better than either previous model had achieved on its own.

Implementation

After we selected the integrated model to use in our automation, we began a three-phase implementation process to validate that the model would perform as expected in a production environment.

Phase 1: Vetting algorithm accuracy against new incidents

In the first phase, we spent two weeks running the integrated model against every new SAP incident that came in from an email request. This initial process only updated the incident’s notes to indicate the AI’s routing decision—the automated system wasn’t controlling the routing yet. After each incident was closed, we examined the incident’s notes to compare the SAP group that the automated system selected to where the SAP support desk personnel actually routed the incident. What we found during this phase was that the accuracy was even higher than our expectation, achieving a remarkable 98.8 percent.

This level of accuracy was much higher than the minimum 80 percent accuracy rate that we had established at the onset of the project that would signal its readiness to move into production. Why set an 80 percent success threshold? At that level, we determined that the solution would clearly improve overall MTTR because the 30 minutes saved for each of the correctly routed incidents would more than make up for any additional time support staff would spend to reroute the 20 percent of incidents that were incorrectly routed.

With the latest accuracy rating at 98.8 percent, we were confident that the integrated model could make significant improvements to our production incident management system.

Phase 2: Production rollout for email-initiated incidents

Putting our automated triage solution into production for email-initiated support requests was a simple process of changing how incoming email requests were routed. Instead of being directed to the triage queue where the incident would await human review, email-based incidents would now be sent through the new automated process. The integrated model would scan the content for keywords, add a tag that identified the incident as being routed by AI, and then assign the ticket to the appropriate SAP group for remediation.

As our integrated model went live on our production environment, we wanted to continue to monitor whether the system was routing each incident correctly. To do so, we communicated with the SAP Support staff, asking them to take the following steps when they saw any incorrectly routed incident that was tagged as coming from AI:

  1. Tag the incident as being incorrectly routed.
  2. Add a brief explanation of why they decided that the incident had been incorrectly routed.
  3. Reroute the incident to the appropriate group.

We used this data to help retrain our algorithm, improve its accuracy, and measure the solution’s efficacy. To date, we’re experiencing a greater than 99 percent accuracy rate. Moreover, the average 30-minute delay that occurred in the human-powered triage model has been reduced to approximately one second—a huge performance increase that is helping reduce our MTTR for email-based incidents.

Phase 3: Extending AI to other non-email modalities

Up until this point, the implementation of our AI-based integrated model had focused on a single input modality: email-based SAP Support requests. However, CSEO offers several different support modalities to empower people to connect with technical support by whatever means is most suitable for them, such as filling out an online support request, calling technical support on the phone or through Skype, among others.

All these non-email-based inputs use a web-based form to initiate the support request. Examples include a person using chat or phone to contact support will have their issue entered into a web-based form by the support agent, or filling out an online request for support by directly entering details into the web form, and so on.

In the original incident management system, every web-based support form would set the routing field to the default support queue. This meant that unless the user changed the field, the incident would sit for an average of 30 minutes until a support person triaged it and directed it to the appropriate SAP group. As illustrated in Figure 2, we saw an opportunity to improve the user experience and improve routing of these incidents by replacing this default routing queue entry with input from the integrated model.

We achieved this by:

  1. Exposing the AI as a web service to integrate it with our web-based support forms.
  2. Having AI analyze the support request details in real time as the agent or user was inputting information into the support form.
  3. Having AI then use that information to set the incident routing to the recommended SAP group; the user could override and change to an alternate group if desired.
Figure 2 illustrates the new, 1-second process of routing an incident to the appropriate response group.
Figure 2. Our AI-based integrated model has been deployed to all our SAP Support input modalities, from email, to chat, to phone, and more. All these input sources feed into our integrated model, which then analyzes the incident’s content and either automatically routes the incident (for email sources) or prepopulates the web-based form that identifies the appropriate SAP queue (for the non-email input sources).

We have now fully implemented this AI solution across our entire SAP incident management system, which handles an average of 5,000 incidents per month. The inclusion of AI into our system has both improved performance in terms of MTTR and increased the routing accuracy across all our input sources.

Lessons learned and best practices

  • Be smart about choosing which process you address first with AI. Not every process is a suitable candidate for an AI-based solution. You need to consider the potential disruption to business-critical systems, the cost of new licensing or additional infrastructure, and the time and cost of training your resources. For CSEO, applying AI to the incoming SAP incident queue was an attractive proposition because it would automate a laborious manual task and could easily be inserted into our existing SAP support process as a bolt-on extension to Azure. For us and for other enterprises that have adopted Azure, there’s no need for new servers, new systems, or additional training. Everything is within Azure, ready for use.
  • Set clear goals that define success. What do you want AI to achieve? What metrics will you use to measure your project’s efficacy? The number itself isn’t important, but the reasoning behind your threshold for success should align with your operational and business objectives. For CSEO, we defined 80 percent accuracy as the minimum requirement to move our AI solution into production because the time savings accrued by having 80 percent of tickets correctly routed would significantly reduce overall MTTR even when accounting for the potential 20 percent of misrouted tickets.
  • Clean your data. As you collect and clean your historical data, invest some time up front to define repeatable processes that you will follow each time you need to prepare your data. Doing so will streamline your efforts to perform this step in any future use and will help with training purposes. For example, if you develop a SQL script to extract data from a database, ensure that you can re-use it for future retraining. You also need to have a good understanding of the nature of your data: what to configure, its type (structured versus unstructured), how to best filter out noise, and so on.
  • Build the best data dictionary possible. Building a good data dictionary begins with having the right subject matter experts (SMEs) who are familiar with the terms that should be used to uniquely define keywords that distinguish one group from another. AI can also assist here, acting as a complement to your SMEs to help refresh the data dictionary on a regular basis. Finally, be sure to vet the keywords lists and strive to use unique words or phrases instead of common ones. For example, if a word such as “SAP” exists in all five data dictionaries, it’s not likely to add value to the data dictionary because it doesn’t distinguish between the groups.
  • Work with familiar automation technologies. Use the programming tools that are familiar to you to automate your processes. Our data scientists had different programming backgrounds, so some chose to program with R, whereas others preferred Python scripting. There isn’t a right or wrong selection; just work with the path of least resistance that delivers the results you require.
  • Find the right balance between accuracy and time cost. Data is the most critical ingredient in any data science-based project, because larger datasets help build more accurate models. However, accuracy isn’t everything. When designing your AI solution, you need to find a balance between accuracy and time cost. There will always be a trade-off between these two aspects. With our dataset of more than 13,000 tickets, we were able to make a highly accurate model in a very short timeframe. If we had a much smaller dataset to work with, either the accuracy would have been much lower, or we would have had to spend much more time to achieve an accurate model. Find the balance between what you require in terms of accuracy and time to build the model that aligns with your definition of success.
  • Validate your results. When we first rolled out our solution into the production environment, the initial results were showing 91 percent accuracy—lower, in fact, than our previous implementation phases. As we reviewed the incoming incident tickets that had been flagged as misrouted, we discovered that Support personnel were tagging all incorrectly routed tickets, not just the tickets with the AI tag. After modifying the analysis to filter out non-AI-tagged tickets, the actual accuracy rate was identified as 98.8 percent.
  • Extend the value of your solution by identifying additional applications or usage scenarios.Design your AI solution so that it can be extended easily or repurposed to benefit other systems. We built our solution so that it could be exposed as a web service, which enabled us to start with email-based incident routing, and then later leverage this same solution to our other support modalities.

Conclusions

Adopting new technologies and platforms can be a difficult sell within an organization—especially if the technology hasn’t previously been integrated into the company’s systems. AI is one example of this; the notion of solving business problems with cognitive computing might be considered too expensive and exotic for some stakeholders. However, this is at the core of what digital transformation can deliver: changing old, siloed ways of thinking; automating operations and incorporating agile methodologies; moving systems to the cloud; and enhancing the user experience. The question for technical decision makers shouldn’t be whether to embrace the new; instead, they should ask: where do I start?

At Microsoft, we’re very excited about the power of AI and the value that it can bring to our organization. And we didn’t have to look far to find a readily available machine learning engine: Azure offers a wide range of capabilities that include AI. Working with Azure Machine Learning Studio enabled us to build our solution without middleware or other third-party licensing, hardware, training, or support. Because we’re already running all our SAP processes in Azure, we’re extracting even more value out of our cloud investment by incorporating Azure’s built-in AI capabilities into the existing system.

By deploying Azure Machine Learning within our SAP incident management system, we’ve significantly reduced the mean time to resolve (MTTR) SAP user-reported issues. Our automated AI solution also allows us to reallocate the human support resources who used to perform this task to more strategic tasks for the business.

Our next steps

From the beginning, our intent has been to build an end-to-end solution that can be applied to many business scenarios surrounding our SAP implementation. This first Azure Machine Learning project has proved a great success, and as such, is laying the groundwork for us to leverage the solution in other applications. We plan to continue our work with Azure Machine Learning and explore incorporating BOT technology to streamline our entire suite of incident support systems, which has the potential to improve the performance and user experience in more than 1,500 apps.

Ultimately, we want to expand our use of Azure Machine Learning and AI beyond support queue automation and explore how we might use it and other Azure products and services to solve new challenges. With our initial implementation completed, we now have a workflow in place to create additional machine learning algorithms and then deploy them to production where they can benefit other business-critical processes.

For more information

If your organization hasn’t done so already, create your free Azure account. Next, extract more value from your technology investments by improving your processes with AI and Azure Machine Learning. Use the following links to learn more about these technologies.