Root Causing Flaky Tests in a Large-Scale Industrial Setting

Wing Lam; Patrice Godefroid (pg); Suman Nath; Anirudh Santhiar; Suresh Thummalapenta

Root Causing Flaky Tests in a Large-Scale Industrial Setting

Wing Lam ,
Patrice Godefroid (pg) ,
Suman Nath ,
Anirudh Santhiar ,
Suresh Thummalapenta

International Symposium on Software Testing and Analysis (ISSTA) | July 2019

Download BibTex

In today’s agile world, developers often rely on continuous integration pipelines to help build and validate their changes by executing tests in an efficient manner. One of the significant factors that hinder developers’ productivity is flaky tests—tests that may pass and fail with the same version of code. Since flaky test failures are not deterministically reproducible, developers often have to spend hours only to discover that the occasional failures have nothing to do with their changes. However, ignoring failures of flaky tests can be dangerous, since those failures may represent real faults in the production code. Furthermore, identifying the root cause of flakiness is tedious and cumbersome, since they are often a consequence of unexpected and non-deterministic behavior due to various factors, such as concurrency and external dependencies.

As developers in a large-scale industrial setting, we first describe our experience with flaky tests by conducting a study on them. Our results show that although the number of distinct flaky tests may be low, the percentage of failing builds due to flaky tests can be substantial. To reduce the burden of flaky tests on developers, we describe our end-to-end framework that helps identify flaky tests and understand their root causes. Our framework instruments flaky tests and all relevant code to log various runtime properties, and then uses a preliminary tool, called RootFinder, to find differences in the logs of passing and failing runs. Using our framework, we collect and publicize a dataset of real-world,anonymized execution logs of flaky tests. By sharing the findings from our study, our framework and tool, and a dataset of logs, we hope to encourage more research on this important problem.