Characterizing Power Management Opportunities for LLMs in the Cloud

Pratyush Patel; Esha Choukse; Chaojie Zhang; Íñigo Goiri; Brijesh Warrier; Nithish Mahalingam; Ricardo Bianchini

Characterizing Power Management Opportunities for LLMs in the Cloud

Pratyush Patel ,
Esha Choukse ,
Chaojie Zhang ,
Íñigo Goiri ,
Brijesh Warrier ,
Nithish Mahalingam ,
Ricardo Bianchini

ASPLOS | April 2024

Download BibTex

Cloud providers and datacenter operators are grappling with massive demand for graphics processing units (GPUs) due to surging use of large language models (LLMs). To try to keep up, enterprises are building new GPU clusters to run LLM workloads, which in turn are running into an energy wall worldwide. Power oversubscription and adding more servers to existing and upcoming datacenters could help alleviate this challenge. However, GPU-heavy workloads like LLMs could create power surges, exceeding fixed power contracts with utility companies. Proper power usage analysis and management would help providers oversubscribe power to add more GPU servers to existing datacenters safely and more efficiently.

In a recent paper: Characterizing Power Management Opportunities for LLMs in the Cloud, researchers from Microsoft analyze power patterns for several popular, open-source LLMs across commonly used configurations and identify opportunities to improve power management for LLMs in the cloud. They present a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, POLCA simulations demonstrate it could deploy 30% more servers in existing clusters while incurring minimal power throttling events. POLCA improves power efficiency, reduces the need for additional energy sources and datacenters, and helps to promptly meet demand for running additional LLM workloads.