Robust Multimodal Failure Detection for Microservice Systems

  • Chenyu Zhao ,
  • Minghua Ma ,
  • Zhenyu Zhong ,
  • Shenglin Zhang ,
  • Zhiyuan Simon Tan ,
  • Xiao Xiong ,
  • Lulu Yu ,
  • Jiayi Feng ,
  • Yongqian Sun ,
  • Yuzhi Zhang ,
  • Dan Pei ,
  • ,
  • Dongmei Zhang

KDD'23 ADS |

Publication | PDF

Proactive failure detection of instances is vitally essential to microservice systems because an instance failure can propagate to the whole system and degrade the system’s performance. Over the years, many single-modal (i.e., metrics, logs, or traces) databased anomaly detection methods have been proposed. However, they tend to miss a large number of failures and generate numerous false alarms because they ignore the correlation of multimodal data. In this work, we propose AnoFusion, an unsupervised failure detection approach, to proactively detect instance failures through multimodal data for microservice systems. It applies a Graph Transformer Network (GTN) to learn the correlation of the heterogeneous multimodal data and integrates a Graph Attention Network (GAT) with Gated Recurrent Unit (GRU) to address the challenges introduced by dynamically changing multimodal data. We evaluate the performance of AnoFusion through two datasets, demonstrating that it achieves the F1-score of 0.857 and 0.922, respectively, outperforming the state-of-the-art failure detection approaches.