Service availability, which is arguably the single most import KPI for cloud computing, can be brought down by various incidents. The state-of-the-art of incident troubleshooting, however, is still an (exhausting) effort of human experts.
Our ongoing project, CloudBrain, aims for inventing new algorithms and building systems for automatic and real-time troubleshooting for large scale Cloud systems. At the algorithms level, CloudBrain tries to construct global views by connecting subcomponents of the systems, and then localize the failed components by leveraging methods from machine learning. At the systems level, CloudBrain leverages the characteristics of the troubleshooting data streams to build troubleshooting operators driven by the troubleshooting scenarios.
Troubleshooting algorithms
DeepView for Azure VM virtual hard disk (VHD) failure pattern detection
DeepView is a system we built for automatically Event17 pattern detection and diagnosis. For Azure IaaS (Infrastructure as a Service), the VMs are running on Azure Compute and their data is stored in Azure Storage. The VMs access their data via Azure Network. When an IaaS VM cannot access its data (virtual hard disk or VHD) on Azure Storage it will get an error code event17 and will be shutdown. Event17 is one of the top reasons hurting the availability of Azure Service. But an event17 may be caused by Azure Compute, Azure Storage or Azure Network. Due to the lack of efficient algorithm for automatic Event17 diagnosis, these E17s are labeled as “ambient”.
DeepView aims to address the above challenge. It constructs a global bipartite graph where Azure Compute clusters are on one side and Azure Storage clusters are on the other side. This gives us a global picture of Azure, which was not done before. By introducing a statistical machine learning algorithm based on the global view, we have detected and revealed many patterns which previously were unknown, for example storage performance issues and top-of-rack switch reload issues.
The global view and bipartite graph idea of DeepView idea can be extended for troubleshooting for other services.
NetBouncer for network link and device failure localization
The availability of data center services is jeopardized by various network incidents. One of the biggest challenges for network incidents handling is to accurately detect and localize the faulty network devices and links in real-time, among hundreds of thousands of networking devices and millions of cables and fibers.
NetBouncer is designed to address this challenge. NetBouncer uses the servers to send IP-in-IP probing packets to measure the packet success probabilities of the network paths without involving switch CPUs. It further introduces an algorithm to map the packet success probabilities of the paths into the packet success rates of the links and devices.
Both our analysis and experimental results show that NetBouncer achieves (close to) zero false positives and false negatives, with various measurement data inconsistencies. We have implemented and deployed NetBouncer in Microsoft data centers, and it is now an indispensable service for network troubleshooting and incident automatic mitigation.
The CloudBrain system
The CloudBrain system is a real-time streaming system motivated by and built for automatic troubleshooting for the Cloud systems. Besides the common characteristics a real-time streaming system, CloudBrain leverages the characteristics of the troubleshooting data (e.g., these data can be compressed in both lossless and lossy ways) for faster and more efficient data processing. It further introduces troubleshooting operators to simplify the writing of the troubleshooting algorithms.
People
Lidong Zhou
Corporate Vice President, Chief Scientist of Microsoft Asia Pacific R&D Group, Managing Director of Microsoft Research Asia
Dan Ports
Principal Researcher