Onion: Identifying Incident-indicating Logs for Cloud Systems

ESEC/FSE'21 |

In cloud systems, incidents affect the availability of services and require quick mitigation actions. Once an incident occurs, operators and developers often examine logs to perform fault diagnosis. However, the large volume of diverse logs and the overwhelming details in log data make the manual diagnosis process time-consuming and error-prone. In this paper, we propose Onion, an automatic solution for precisely and efficiently locating incident-indicating logs, which can provide useful clues for diagnosing the incidents. We first point out three criteria for localizing incident-indicating logs, i.e., Consistency, Impact, and Bilateral-Difference. Then we propose a novel agglomeration of logs, called log clique, based on which these criteria are satisfied. To obtain log cliques, we develop an incident-aware log representation and a progressive log clustering technique. Contrast analysis is then performed on the cliques to identify the incident-indicating logs. We have evaluated Onion using well-labeled log datasets. Onion achieves an average F1-score of 0.95 and can process millions of logs in only a few minutes, demonstrating its effectiveness and efficiency. Onion has also been successfully applied to the cloud system of Microsoft. Its practicability has been confirmed through the quantitative and qualitative analysis of the real incident cases.