Performance Simulation for large-scale distributed training

Deep learning (DL) has been increasingly adopted by a variety of software-intensive systems. Developers mainly use GPUs to accelerate the training, testing, and deployment of DL models. However, the distributed training performance is often unknown to them before the DL job executes. Therefore, an improper choice of cluster scale or neural hyperparameters can cause such a job to run out of the budget and early drop. Our recent empirical study has found that many killing DL jobs are due to the exhaustion of GPU memory. This leads to a horrendous waste of computing resources, especially for large-scale training. In this paper, we propose PerfSim, an accurate predictor for distributed training performance. PerfSim employs an analytic simulation approach to systematically estimate the execution time of both the computation operation and communication operation. We have evaluated PerfSim on 5 real-world representative models with different scales and hyperparameters. Our extensive experiments show that PerfSim is effective in predicting distributed training performance.