An Empirical Study on Low GPU Utilization of Deep Learning Jobs

ICSE 2024 |

Published by IEEE/ACM

The 46th International Conference on Software Engineering

DOI

Deep learning plays a critical role in numerous intelligent software applications. Enterprise developers submit and run deep learning jobs on shared, multi-tenant platforms to efficiently train and test models. These platforms are typically equipped with a large number of graphics processing units (GPUs) to expedite deep learning computations. However, certain jobs exhibit rather low utilization of the allocated GPUs, resulting in substantial resource waste and reduced development productivity. This paper presents a comprehensive empirical study on low GPU utilization of deep learning jobs, based on 400 real jobs (with an average GPU utilization of 50% or less) collected from Microsoft’s internal deep learning platform. We discover 706 low-GPU-utilization issues through meticulous examination of job metadata, execution logs, runtime metrics, scripts, and programs. Furthermore, we identify the common root causes and propose corresponding fixes. Our main findings include: (1) Low GPU utilization of deep learning jobs stems from insufficient GPU computations and interruptions caused by non-GPU tasks; (2) Approximately half (46.03%) of the issues are attributed to data operations; (3) 45.18% of the issues are related to deep learning models and manifest during both model training and evaluation stages; (4) Most (84.99%) low-GPU-utilization issues could be fixed with a small number of code/script modifications. Based on the study results, we propose potential research directions that could help developers utilize GPUs better in cloud-based platforms.