Statically Detecting Data Leakages in Data Science Code

Pavle Subotić; Uros Bojanić; Milan Stojić

Statically Detecting Data Leakages in Data Science Code

Pavle Subotić ,
Uros Bojanić ,
Milan Stojić

International Workshop on the State Of the Art in Program Analysis | June 2022

Download BibTex

Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world, since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs most-mortem using runtime methods. In this paper, we develop a static data leakage analysis to detect several instances of data leakages during development time. Our analysis is constructed to be light weight so that it can be performed in seconds. We have integrated our analysis into the NBLyzer static analyzer. To the best of our knowledge, we propose the first static detection of data leakages.