Gist: Efficient Data Encoding for Deep Neural Network Training

Animesh Jain; Amar Phanishayee; Jason Mars; Lingjia Tang; Gennady Pekhimenko

Gist: Efficient Data Encoding for Deep Neural Network Training

Animesh Jain ,
Amar Phanishayee ,
Jason Mars ,
Lingjia Tang ,
Gennady Pekhimenko

International Symposium on Computer Architecture (ISCA 2018) | June 2018

Download BibTex

Modern deep neural networks (DNNs) training typically relies on GPUs to train complex hundred-layer deep networks. A significant problem facing both researches and industry practitioners is that, as the networks get deeper, the available GPU main memory becomes a primary bottleneck, limiting the size of networks it can train.

In this paper, we investigate widely used DNNs and find that the major contributors to memory footprint are intermediate layer outputs (feature maps). We then introduce a framework for DNN-layer-specific optimizations (e.g., convolution, ReLU, pool) that significantly reduce this source of main memory pressure on GPUs. We find that a feature map typically has two uses that are spread far apart temporally. Our key approach is to store an encoded representation of feature maps for this temporal gap and decode this data for use in the backward pass; the full-fidelity feature maps are used in the forward pass and relinquished immediately.

Based on this approach, we present Gist, our system that employs two classes of layer-specific encoding schemes – lossless and lossy – to exploit existing value redundancy in DNN training to significantly reduce the memory consumption of targeted feature maps. For example, one insight is by taking advantage of the computational nature of back propagation from pool to ReLU layer, we can store the intermediate feature map using just 1 bit instead of 32 bits per value. We deploy these mechanisms in a state-of-the-art DNN framework (CNTK) and observe that Gist reduces the memory footprint to 2× across 5 state-of-the-art image classification DNNs, with an average of 1.8× with only 4% performance overhead and no effect on training accuracy. We also show that further software (e.g., CuDNN) and hardware (e.g., dynamic allocation) optimizations can result in even larger footprint reduction (upto 4.1×).