Capuchin: Tensor-based GPU Memory Management for Deep Learning

  • Quan Peng ,
  • Xuanhua Shi ,
  • Hulin Dai ,
  • Hai Jin ,
  • Weiliang Ma ,
  • Qian Xiong ,
  • ,
  • Xuehai Qian

ASPLOS |

In recent years, deep learning has gained unprecedented success in various domains, the key of the success is the larger and deeper deep neural networks (DNNs) that achieved very high accuracy. On the other side, since GPU global memory is a scarce resource, large models also pose a significant challenge due to memory requirement in the training process. This restriction limits the DNN architecture exploration flexibility.

In this paper, we propose Capuchin, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation. The key feature of Capuchin is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, one can exploit the total memory optimization space and offer the fine-grain and flexible control of when and how to perform memory optimization techniques.

We deploy Capuchin in a widely-used deep learning framework, Tensorflow, and show that Capuchin can reduce the memory footprint by up to 85% among 6 state-of-the-art DNNs compared to the original Tensorflow. Especially, for the NLP task BERT, the maximum batch size that Capuchin can outperforms Tensorflow and gradient-checkpointing by 7x and 2.1x, respectively. We also show that Capuchin outperforms vDNN and gradient-checkpointing by up to 286% and 55% under the same memory oversubscription.