AI brain concept image

Systems Innovation

Boosting Cloud Efficiency: Harnessing Data-Driven Decision-Making and Optimization Techniques

Share this page

Si Qin, Principal Research Manager; Fangkai Yang, Senior Researcher; Rujia Wang, Principal Research PM; Qingwei Lin, Partner Research Manager; Saravan Rajmohan, Partner Director AI and Applied Research and Dongmei Zhang, Distinguished Scientist and Vice President. 


Introduction

Microsoft’s cloud system serves as the backbone for the daily operations of hundreds of thousands of organizations, driving productivity and collaboration. The foundational infrastructure demands both high reliability and efficiency. In our last system innovation research blog post, we delved into our recent work on bringing AI into service reliability scenarios, where we aim to achieve continuous availability through AI-assisted automation tools. In this blog, we explore recent innovations that continually enhance hyper-scale cloud capacity efficiency, delivering substantial operational cost savings for our customers. 

Efficient cloud operation is pivotal for both Microsoft and our customers. On one hand, cloud engineers need to ensure that we maximize our capacity utilization without compromising reliability, availability, and sustainability; on the other hand, our customers seek performant and cost-effective cloud services to build scalable solutions atop our infrastructure.  However, optimizing cloud efficiency is a multifaceted challenge due to various factors: 

Fluctuation and Uncertainty. Workload and computing resource demands fluctuate over time due to daily and weekly cycles, overall cloud market trends, and sudden spikes or outages. These variations lead to utilization peaks and valleys, requiring the cloud platform to secure sufficient capacity for peak hours. These dynamic signals make it very challenging for engineers and customers to use traditional decision-making techniques for effective efficiency optimizations, 

Multi-Constraints, Multi-Objectives, and Multi-Parameters. Each workload deployment and operation must satisfy multiple constraints from customers and the platform. The platform needs to balance multiple competing objectives, such as capacity utilization, service performance, and availability. Management tasks involve tuning various parameters and thresholds for different objectives.  

North Star: Proactive data-driven decision-making for cloud capacity management 

We envision the future of highly efficient and reliable cloud platforms need a proactive design that can take the future status of the system into account in the decision-making process. This forward-looking approach relies on data-driven models to predict the future status of cloud platforms, laying the foundation for downstream proactive decision-making. 

The CapacityInsider framework (Figure 1) shows the overall strategies to bring data-driven approaches into proactive cloud capacity management. This framework helps facilitate automated, informed, and forward-looking decision-making in cloud capacity management. The framework focuses on four key areas:  

  • Detecting Allocation Failure Issues: By identifying allocation failure issues, we fortify the system against potential disruptions, enhancing its resilience and reliability. 
  • Diagnosing Root Causes and Bottlenecks: The framework delves into diagnosing root causes and bottlenecks for allocation failures, providing insights crucial for optimizing performance and addressing inefficiencies. 
  • Predicting Future Workloads and System Status: Through advanced predictive modeling, the framework anticipates future workloads and system status, empowering the system to proactively adapt to changing demands. 
  • Optimizing Decision-Making: The capacity management system’s decision-making processes are fine-tuned and optimized, ensuring efficiency across varied objectives such as capacity utilization, service performance, and availability. 

By addressing these challenges, we aim to develop a comprehensive, automated, AI-driven, and proactive capacity management system. This system not only reacts to current demands but anticipates and aligns with future needs, elevating the efficiency and reliability of Microsoft’s cloud infrastructure to new heights. 

Capinsider framework
Figure 1: CapacityInsider: Data-driven Decision-making for Efficient Cloud Capacity Management 

In the subsequent sections, we present two cases where data-driven decision-making techniques help with cloud resource optimization and potentially save our customers’ costs.  

  • In the first study, we show how mixed spot and on-demand VMs could be combined to improve the overall utilization and without impacting service reliability.  
  • In the second study, we discuss how to effectively bin-pack container workloads through chance-constrained optimization algorithms.  

Case Study: Mixture of Spot and On-demand VMs for Low-cost Computing 

Cloud providers usually ensure service availability and reliability by allocating sufficient computing resources, so that it appears to customers that resources are never in short supply. This often leads to underutilized resources, and cloud providers are motivated to monetize excess capacity by selling them at discounted prices to compensate for the degradation in resource availability. Example offerings include spot virtual machine (VM), spot block VM and harvest VM. The trade-off between cost-effectiveness and resource availability often creates a dilemma for users. On the one hand, some workloads, such as machine learning training and inferencing, caching, and big data processing, can potentially run at reduced reliability for lower cost. On the other hand, cloud providers, while sharing some information about resource availability such as estimated eviction rates for spot VMs at the region level, do not guarantee resource availability. 

One natural solution to prevent similar incidents as the above is to deploy services in a VM Scaling Group (VMSG) with both spot and on-demand VMs, so that there would still be enough on-demand VMs to support services even if all spot instances were evicted. However, all of them either support a static mixture ratio of spot and on-demand VMs or allocate spot instances in a greedy fashion, which can be less cost-effective when the eviction rate is low.  

In our recent ASPLOS’23 paper, “Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs”, the researchers propose an intelligent framework to optimize customer cost while maintaining resource availability by dynamically mixing on-demand VMs with spot VMs. Snape (Spot On-demand Perfect Mixture), is composed with a reliable model for predicting the eviction rate of spot VMs from the production trace and an intelligent constrained reinforcement learning (CRL) framework for learning the best mixture policy, given the predicted eviction rate and other service signals. We first characterize the eviction behaviors of spot VMs by examining traces collected over three months from a production cloud system. Then, to better learn the eviction behaviors and achieve optimal decisions in the long term, we employ the framework of CRL: if the number of in-service VMs temporarily drops below the target value for a short period, the CRL will take more conservative actions in its future decision-making to ensure that SLOs are not violated. Figure 2 shows the overall framework of Snape. 

Overview of snape framework
Figure 2: The overview of Snape framework 

This proactive design enables an online decision-making system for dynamically adjusting the mixture of on-demand and spot VMs and ensures that a more aggressive and cheaper policy is only adopted when the reliability is high (low predicted eviction rates of spot VM). Experiments across different configurations show that Snape achieves 44% savings compared to the policy of using only on-demand VMs, and at the same time, maintains 99.96% availability—2.77% higher than with a policy of using only spot VMs. 

Case Study: Workload Bin-packing Chance-constrained Optimization Approach  

In today’s digital landscape, many large companies operate their services using containers and employ Kubernetes-like systems for container orchestration and resource management on modern cloud platforms. One significant challenge these platforms face is container scheduling. To optimize utilization, the platform consolidates multiple containers onto a single machine, where the combined maximum resources required by the containers may surpass the machine’s capacity. However, this consolidation introduces the risk of machine resource violations, which can lead to container performance degradation or even service unavailability.  

Container scheduling can be naturally modeled as the Stochastic Bin Packing Problem (SBPP) to optimize resource utilization while maintaining violations below a desired low level. Much research on SBPP assumes that all machines (also referred to as bins) are empty before allocation, and these approaches are evaluated based on the number of used bins. However, in practice, the total resources required by a service change diurnally and weekly. For example, in Figure 3, we showed diverse CPU usage for three different services and their pattern across one week. As a result, services often request to allocate and delete a batch of containers daily to increase resource utilization. On the platform side, most machines typically host a few containers when new allocations arrive. In cases where non-empty machines can accommodate all or most requested containers, the previous metric, i.e., the number of bins used fails to differentiate the effectiveness of allocation strategies. 

chart, histogram
Figure 3 Diverse CPU core usage across three services 

In our recent SIGKDD’22 paper, Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach”, we introduce a new optimization metric, Used Capacity at Confidence (UCaC) and propose a unified problem formulation for the SBPP that accommodates both empty and non-empty machines. Furthermore, we designed heuristic and cutting stock based exact algorithms to solve the problem. Extensive experiments on both synthetic and real cloud traces demonstrate that our UCaC-based optimization methodology outperforms existing approaches that focus on optimizing the number of machines used. Specifically, we have taken the first step towards addressing the stochastic bin packing problem on non-empty machines, which is a crucial issue in cloud resource scheduling. 

We assessed the proposed methods using real trace data from a first-party application. The dataset comprises 17 primary services and over 10,000 containers during peak times. We set the UCaC at 99.9% and evaluated the methods using two metrics: average number of machines and total violations. Compared to the best fit method, our proposed heuristics and cutting stock based approaches achieved a 1.3% reduction in node utilization and up to a 23% reduction in violations. 

Conclusions 

Microsoft Cloud hosts a wide range of workloads with varying characteristics, including size, configuration, usage patterns, resources, and SLA requirements. Specialized workloads, such as GPU-intensive AI tasks, require dedicated computing clusters and specialized SLA management. Evaluating infrastructure and workloads together is crucial for optimizing cloud efficiency and capacity decisions.  

By adopting advanced data-driven methodologies, our research can help throughout all phases of cloud capacity management and achieve a proactive and self-sustaining cloud environment.  


Acknowledgement

We would like to thank our collaborators and contributors to the research work: Yixin Fang, Silvia Yu, Terry Yang, Soumya Ram, Zhen Ma, Íñigo Goiri, Eli Cortez, Thomas Moscibroda, Ricardo Bianchini, Lu Wang, Jue Zhang, Liqun Li, Bo Qiao, Camille Couturier, Victor Rühle, and Chetan Bansal.