How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service

SoCC 2022 |

Organized by ACM

PDF

Production incidents in today’s large-scale cloud services can be extremely expensive in terms of customer impacts and engineering resources required to mitigate them. Despite continuous reliability efforts, cloud services still experience severe incidents due to various root-causes. Worse, many of these incidents last for a long period as existing techniques and practices fail to quickly detect and mitigate them. To better understand the problems, we carefully study hundreds of recent high severity incidents and their postmortems in a large cloud based service used by hundreds of millions of users. We answer: (a) why the incidents occurred and how they were resolved, (b) what the gaps were in current processes which caused delayed response, and (c) what automation could help make the services resilient. Finally, we uncover interesting insights by a novel multi-dimensional analysis that correlates different troubleshooting stages (detection, root-causing and mitigation), and provide guidance on how to tackle complex incidents through automation or testing at different granularity.