Should expert radiologists label individual images or entire examinations?

November 16, 2022

Share this page

Winner of an RSNA 2022 Trainee Research Prize (opens in new tab):

Read the full pre-print on Arxiv [2211.15924] Weakly Supervised Learning Significantly Reduces the Number of Labels Required for Intracranial Hemorrhage Detection on Head CT (arxiv.org) (opens in new tab)

Jacopo Teneggi (opens in new tab), a PhD student at Johns Hopkins University (JHU), used InnerEye OSS and Azure Machine Learning to help answer the question: “Should expert radiologists label individual images or entire examinations?”. This project was conducted in collaboration with Prof. Paul Yi, MD (opens in new tab) (Director of University of Maryland Medical Intelligent Imaging (UM2ii) Center) and under the supervision of Prof. Jeremias Sulam (opens in new tab) (Assistant Professor in the Biomedical Engineering Department (opens in new tab) at JHU).

Challenge

Deep Learning (DL) continues to drive exciting advances across medical imaging tasks, from image reconstruction and enhancement to automatic lesion detection and segmentation. However, the labor-intensive collection of image annotations by experts hinders the development of DL models in radiology. Usually, people train classifiers in a fully supervised fashion requiring many labelled CT slices, and consequently huge budget for radiologists to look at each image slice. An alternative approach is to use weakly-supervised learning, such as using radiology examination report labels, instead of the individual image labels. Jacopo’s work compares these two approaches to compare their performance and scalability with large numbers of images.

Solution

A Multiple Instance Learning (MIL) weakly supervised machine learning model was developed and compared to a strongly supervised model that used individual image labels for training. Both models were trained on the RSNA 2019 Brain CT Haemorrhage Challenge (opens in new tab) dataset, which comprises 21 784 examinations with a total of 752 803 images.

graphical user interface, application — https://www.kaggle.com/competitions/rsna-intracranial-hemorrhage-detection/overview (opens in new tab)

Every image in the RSNA dataset was labelled by expert neuroradiologists with the type(s) of haemorrhage present. In addition to the RSNA dataset, models were compared on two external test sets—the CQ500 dataset (opens in new tab) (436 examinations) and the CT-ICH dataset (75 examinations). The CQ500 dataset only provides examination-level labels, while the CT-ICH dataset provides both image-level labels and manual annotations of the bleeds. Hence, Jacopo extended the CQ500 dataset with the segmentations provided in the BHX dataset (opens in new tab).

The MIL framework can regard each examination as a bag of images labelled as “with haemorrhage” as soon as one image shows signs of haemorrhage, or “healthy” otherwise. By using an attention-based MIL model, two models were trained: a fully supervised model using image labels, and a weakly supervised model using examination-level labels. A ResNet18 was pretrained on ImageNet as encoder, a two-layer attention module, and a binary linear classifier with sigmoid activation. With these, a strong learner is the composition of the encoder with the classifier, whereas a weak learner is the composition of the encoder (applied to each image in the examination), the attention mechanism, and the final classifier.

The ML models were originally developed using the PyTorch framework and were easily ported to InnerEye OSS Deep Learning Toolkit (opens in new tab) via the Bring Your Own PyTorch Lightning Model functionality of the OSS. The integration of InnerEye OSS with Azure Machine Learning (opens in new tab) allowed Jacopo to store and manage the large volumes of data required for training straightforwardly, and to train models at scale on multiple GPUs efficiently. The flexibility of the OSS with Azure Machine Learning, coupled with the direct availability of TorchIO (opens in new tab) in the OSS: a popular open-source Python library for efficient processing and augmentation of 3D medical images, enabled Jacopo to prototype his models quickly. Finally, Jacopo extended the InnerEye code to support Weights & Biases (opens in new tab) logging to help track experiments, evaluate model performance, and finetune the hyperparameters of the training process.

The results in Figure 1 show that there is virtually no difference between strong (SL) and weak learners (WL) on the RSNA and CQ500 datasets. The weak learner, however, exhibits significantly higher generalization power on the CT-ICH dataset.

graphical user interface, application, Word

Figure 2 shows results for image-level haemorrhage detection. There appears to be no significant qualitative (saliency maps) or quantitative (f₁ scores) difference between the strong and weak learners.

Figure 3 shows the mean examination-level haemorrhage detection performance on the RSNA dataset as a function of the number of labels available to each learner during training, denoted by m. The fully supervised strong learner performs better with fewer than around 10,000 image labels, however the weak learners quickly outperform the strong learners with more than 10,000 labels. Importantly, the performance of weak and strong learners trained on the entire RSNA dataset is comparable, with weak learners using ≈ 35 times fewer labels.

Outcome

The results suggest that with MIL, radiologists may not need to provide labor-intensive, image-level annotations for 3D imaging volumes (e.g., CT/MRI) to train high performing ML models. This approach could dramatically reduce the time-consuming data annotation process, overcoming a major hurdle in machine learning for medical imaging.

Project InnerEye Open-Source Software for Medical Imaging AI

Should expert radiologists label individual images or entire examinations?

Share this page