Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs

Srikant Bharadwaj; Shomit Das; K. Mazumdar; Bradford M. Beckmann; Stephen Kosonocky

Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs

Srikant Bharadwaj ,
Shomit Das ,
K. Mazumdar ,
Bradford M. Beckmann ,
Stephen Kosonocky

Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems | March 2024

Organized by ACM

DOI | Publication

Download BibTex

With the continuous improvement of on-chip integrated voltage regulators (IVRs) and fast, adaptive frequency control, dynamic voltage-frequency scaling (DVFS) transition times have shrunk from the microsecond to the nanosecond regime, providing immense opportunity to improve energy efficiency. The key to unlocking the continued improvement in V/f circuit technology is the creation of new, smarter DVFS mechanisms that better adapt to rapid fluctuations in workload demand.

It is particularly important to optimize fine-grain DVFS mechanisms for graphics processing units (GPUs) as the chips become ever more important workhorses in the datacenter. However, GPU’s massive amount of thread-level parallelism makes it uniquely difficult to determine the optimal V/f state at run-time. Existing solutions—mostly designed for single-threaded CPUs and longer time scales—fail to consider the seemingly chaotic, highly varying nature of GPU workloads at short time scales.

This paper proposes a novel prediction mechanism, PCSTALL, that is tailored for emerging DVFS capabilities in GPUs and achieves near-optimal energy efficiency. Using the insights from our fine-grained workload analysis, we propose a wavefront-level program counter (PC) based DVFS mechanism that improves program behavior prediction accuracy by 32% on average as compared to the best performing prior predictor for a wide set of GPU applications at 1μs DVFS time epochs. Compared to the current state-of-art, our PC-based technique achieves 19% average improvement when optimized for Energy-Delay² Product (ED²P) at 50μs time epochs, reaching 32% when operated with 1μs DVFS technologies.