Active label cleaning for improved dataset quality under resource constraints

Melanie Bernhardt; Daniel Coelho de Castro; Ryutaro Tanno; Anton Schwaighofer; Kerem C. Tezcan; Miguel Monteiro; Shruthi Bannur; Matthew P. Lungren; Aditya Nori; Ben Glocker; Javier Alvarez-Valle; Ozan Oktay

Active label cleaning for improved dataset quality under resource constraints

Melanie Bernhardt ,
Daniel Coelho de Castro ,
Ryutaro Tanno ,
Anton Schwaighofer ,
Kerem C. Tezcan ,
Miguel Monteiro ,
Shruthi Bannur ,
Matthew P. Lungren ,
Aditya Nori ,
Ben Glocker ,
Javier Alvarez-Valle ,
Ozan Oktay

Nature Communications | March 2022 , Vol 13(1161): pp. 1-11

Download BibTex

Abstract: Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation – which we term “active label cleaning”. We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label cleaning enables correcting labels up to 4 times more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality.

Publication Downloads

InnerEye – Deep Learning

September 22, 2020

This is a deep learning toolbox to train models on medical images (or more generally, 3D images). It integrates seamlessly with cloud computing in Azure.

Download Data