DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks

  • Keshav Santhanam ,
  • Siddharth Krishna ,
  • ,
  • Andrew Fitzgibbon ,
  • Tim Harris

EuroMLSys '21: Proceedings of the 1st Workshop on Machine Learning and Systems |

The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, horizontal, and pipeline parallelism. However, selecting the best set of strategies for a given model and hardware configuration is challenging because debugging and testing on clusters is expensive. In this work we propose DistIR, an IR for explicitly representing distributed DNN computation that can capture many popular distribution strategies. We build an analysis framework for DistIR programs, including a simulator and reference executor that can be used to automatically search for an optimal distribution strategy. Our unified global representation also eases development of new distribution strategies, as one can reuse the lowering to per-rank backend programs. Preliminary results using a grid search over a hybrid data/horizontal/pipeline-parallel space suggest DistIR and its simulator can aid automatic DNN distribution.

Publication Downloads

DistIR

October 20, 2021

DistIR is an intermediate representation (IR) and associated set of tools for optimizing distributed machine learning computations (both training and inference). An IR is a format for representing programs used by compilers and software analysis tools. DistIR's key feature is a simulator that allows the user to estimate the runtime and memory requirements of a model without running it, thereby enabling the user to quickly pick the best distribution strategy and configuration of the model before deploying it to a cluster.