Fighting the Fog of War: Automated Incident Detection for Cloud Systems
- Liqun Li ,
- Xu Zhang ,
- Xin Zhao ,
- Hongyu Zhang ,
- Yu Kang ,
- Pu Zhao ,
- Bo Qiao ,
- Shilin He ,
- Pochian Lee ,
- Jeffrey Sun ,
- Feng Gao ,
- Li Yang ,
- Qingwei Lin 林庆维 ,
- Saravanakumar Rajmohan ,
- Zhangwei Xu ,
- Dongmei Zhang
2021 USENIX Annual Technical Conference (USENIX ATC'21) |
Incidents and outages dramatically degrade the availability of large-scale cloud computing systems such as AWS, Azure, and GCP. In current incident response practice, each team has only a partial view of the entire system, which makes the detection of incidents like fighting in the “fog of war”. As a result, prolonged mitigation time and more finance loss are incurred. In this work, we propose an automatic incident detection system, namely Warden, as a part of the Incident Management (IcM) platform.Warden collects alerts from different services and detects the occurrence of incidents from a global perspective. For each detected potential incident, Warden notifies relevant on-call engineers so that they could properly prioritize their tasks and initiate cross-team collaboration. We implemented and deployed Warden in the IcM platform of Azure. Our evaluation results based on data collected in an 18-month period from 26 major services show that Warden is effective and outperforms the baseline methods. For the majority of successfully detected incidents ( 68%), Warden is faster than human, and this is particularly the case for the incidents that take long time to detect manually.