Intelligent Overclocking for Improved Cloud Efficiency

AIOps '24 workshop @ ASPLOS [5th International Workshop on Cloud Intelligence / AIOps] |

Overclocking computing nodes in datacenters is emerging as an area of great interest since it presents an opportunity to temporarily boost performance and handle transient demand surges more cost-effectively for cloud providers. However, indiscriminate overclocking increases risks like power capping events that negate performance gains, accelerated component wear, and power consumption spikes. This work proposes an intelligent overclocking orchestration framework that employs a holistic system continuously monitoring workload characteristics, resource utilization, and power telemetry across the infrastructure. It leverages advanced modeling techniques to anticipate demand and optimize overclocking decisions based on performance benefits and mitigation of associated risks through adaptive policies. Preliminary evaluations on production cloud workload traces demonstrate the efficacy of the intelligent overclocking system in reducing power capping incidents by up to 100 times compared to naive approaches. Furthermore, the work outlines ongoing research directions, including the application of reinforcement learning techniques to derive globally optimal overclocking policies while incorporating fairness and reliability objectives, and seamless extensibility to emerging accelerator technologies.