Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

  • Lu Wang ,
  • Chaoyun Zhang ,
  • Ruomeng Ding ,
  • Yong Xu ,
  • Qihang Chen ,
  • Wentao Zou ,
  • Qingjun Chen ,
  • Meng Zhang ,
  • Xue‐Chao Gao ,
  • Hao Fan ,
  • S. Rajmohan ,
  • ,
  • Dongmei Zhang

KDD'23 ADS |

Publication

In microservice systems, the identification of root causes of anomalies is imperative for service reliability and business impact. This process is typically divided into two phases: (i)constructing a service dependency graph that outlines the sequence and structure of system components that are invoked, and (ii) localizing the root cause components using the graph, traces, logs, and Key Performance Indicators (KPIs) such as latency. However, both phases are not straightforward due to the highly dynamic and complex nature of the system, particularly in large-scale commercial architectures like Microsoft Exchange. In this paper, we propose a new framework that employs Hierarchical Reinforcement Learning from Human Feedback (HRLHF) to address these challenges. Our framework leverages the static topology of the microservice system and efficiently employs the feedback of engineers to reduce uncertainty in the discovery of the service dependency graph. The framework utilizes reinforcement learning to reduce the number of queries required from O(N2) to O(1), enabling the construction of the dependency graph with high accuracy and minimal human effort. Additionally, we extend the discovered dependency graphs to window causal graphs that capture the characteristics of time series over a specified time period, resulting in improved root cause analysis accuracy and robustness. Evaluations on both real datasets from Microsoft Exchange and synthetic datasets with injected anomalies demonstrate superior performance on various metrics compared to state-of-the-art methods. It is worth mentioning that, our framework has been integrated as a crucial component in Microsoft M365 Exchange service.