An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure

  • Ze Li ,
  • Qian Cheng ,
  • Ken Hsieh ,
  • Yingnong Dang ,
  • Peng Huang ,
  • Pankaj Singh ,
  • Xinsheng Yang ,
  • ,
  • Youjiang Wu ,
  • Sebastien Levy ,
  • Murali Chintalapati

NSDI'20 |

Modern cloud systems have a vast number of components that continuously undergo changes. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals and correlates each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. Its core decision logic includes an ensemble ranking algorithm that determines which rollout caused the fault signals and a binary classifier that assesses the impact of the fault signals. The analysis result determines whether a rollout is safe to proceed or should be stopped.
By using a lambda architecture, Gandalf provides both real-time and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for data-plane rollouts. For control-plane rollouts, Gandalf achieves 94.87% precision and 99.84% recall.