robotics workshop header - server hallway with a color gradient overlay

Efficient AI

Making Azure’s big bet possible

Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM training and inference efficiency an important challenge.

In the Azure Research – Systems (opens in new tab) group we are working on improving the Azure infrastructure including hardware, power, and serving.

Some of the work we have done so far:

  • Publication Splitwise: Efficient generative LLM inference using phase splitting 

    Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state of-the-art batching […]

  • Publication POLCA: Power Oversubscription in LLM Cloud Providers 

    Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the […]