Accelerating Deep Learning Training : A Storage Perspective

PhD Thesis: University of Texas at Austin |

Deep Learning, specifically Deep Neural Networks (DNNs), is stressing storage systems in new ways, moving the training bottleneck to the data pipeline (fetching, pre-processing data, and writing checkpoints), rather than computation at the GPUs; this leaves the expensive accelerator devices stalled for data. While prior research has explored different ways of accelerating DNN training time, the impact of storage systems, specifically the data pipeline, on ML training has been relatively unexplored. In this dissertation, we study the role of data pipeline in various training scenarios, and based on the insights from our study, we present the design and evaluation of systems that accelerate training.

We first present a comprehensive analysis of how the storage subsystem affects the training of the widely used DNN models by building a tool, DS-Analyzer. Our study reveals that in many cases, DNN training time is dominated by data stalls: time spent waiting for data to be fetched from(or written to) storage and pre-processed. We then describe CoorDL, a user-space data loading library to address data stalls in dedicated single-user servers with fixed resource capacities. Next, we design and evaluate Synergy, a work-load aware scheduler for shared GPU clusters that mitigates data stalls by allocating auxiliary resources like CPU and memory cognizant of workload requirements. Finally, we present CheckFreq, a framework that frequently writes model state to storage (checkpoint) for fault-tolerance, thereby reducing wasted GPU work on job interruptions, while also minimizing stalls due to checkpointing.

Our dissertation shows that data stalls squander away the improved performance of faster GPUs. Our dissertation further demonstrates that an efficient data pipeline is critical to speeding up end-to-end training, by building and evaluating systems that mitigate data stalls in several training scenarios.