Micro Co-design

The massive scale of cloud infrastructure services enables, and often necessitates, vertical co-design of the infrastructure stack by cloud providers. Micro co-design is a Minimally invasive, Cheap, and retro-fittable approach to co-design that extracts efficiency out of existing software infrastructure layers. It does this by making lightweight changes to generic software interfaces. We are exploring micro co-design in the context of four systems: Instalytics, Astra, Gandiva, and Quiver, aimed at improving the efficiency and functionality of key infrastructure for big data analytics and AI.

Instalytics achieves significant cost and performance improvements to big data analytics processing over large data sets, by co-design of the distributed file system layer and the compute layer. Instalytics customizes the 3-way replication that happens at the store, by keeping each copy partitioned by a different dimension, while still ensuring efficient recovery. With an end-to-end implementation and evaluation of Instalytics on a cluster of 500 machines, we have demonstrated significant query performance and cost savings on micro benchmarks and representative production workloads.
Astra is a compiler and execution engine for speeding up execution of a deep learning training job. Astra employs co-design by customizing the compilation model to fit the unique properties of DNN training – in particular, the remarkable predictability across mini-batch iterations. Thus, instead of using a static performance model to reason about which optimizations to use, Astra defers the choice of optimizations to runtime, and uses each mini-batch iteration as an opportunity to explore the optimization space. It uses various techniques such as dynamic fine-grained profiling and barriers to prune the large state space of optimizations. We have implemented Astra in two popular deep learning frameworks – pyTorch and Tensorflow. Compared to native pyTorch and Tensorflow, Astra improves performance of “long tail” models (models that use non-standard operators and hence don’t fit hand-optimized libraries such as cuDNN) by 2.5 to 3x. Compared to Tensorflow XLA which is a static optimizer, Astra achieves up to 70% speedup.
Gandiva is a co-designed cluster scheduler for running deep learning training jobs. Similar to Astra, Gandiva leverages the predictability of DNN training jobs to make the scheduler “continuous” and “introspective”. For example, it performs efficient time-slicing by aligning suspension of a job with the end of a mini-batch, thus reducing the amount of data to checkpoint by 50x. It uses application-aware profiling (mini-batch time) to decide when to migrate jobs, and when to pack multiple jobs on the same GPU. Gandiva achieves better cluster utilization and job latency due to the tight integration between the job scheduler and the deep learning framework (Tensorflow and pyTorch). We have implemented Gandiva as a scheduler in Kubernetes, along with changes to two frameworks – pyTorch and Tensorflow. We see that Gandiva improves cluster utilization by 26%, and improves time to initial feedback (latency) by 77%. The time-slicing in Gandiva also improves hyper-parameter exploration/ AutoML by nearly 6x.
Quiver is an informed storage cache for deep learning training (DLT) jobs in a cluster of GPUs. Quiver employs domain-specific intelligence within the caching layer, to achieve much higher efficiency compared to a generic storage cache. First, Quiver uses a secure hash-based addressing to transparently reuse cached data across multiple jobs and even multiple users operating on the same dataset. Second, by co-designing with the deep learning framework (e.g. PyTorch), Quiver employs a technique of substitutable cache hits to get more value from the existing contents of the cache, thus avoiding cache thrashing when cache capacity is much smaller than the working set. Third, Quiver dynamically prioritizes cache allocation to jobs that benefit the most from the caching.

People

Kaushik Rajan

Principal Researcher

Learn more

Ramachandran Ramjee

Partner Research Manager

Learn more

Nipun Kwatra

Principal Researcher

Learn more

Lidong Zhou

Corporate Vice President, Chief Scientist of Microsoft Asia Pacific R&D Group, Managing Director of Microsoft Research Asia

Learn more