Systems Innovation

Algorithmic and hardware innovation for efficient, reliable, and sustainable cloud services and infrastructure at scale

The motivation behind Systems Innovation research

Systems Innovation is a joint collaboration between Microsoft 365, MSR and Azure focused on leveraging our deep workload understanding and combining algorithmic research with AI/ML techniques and hardware innovation to provide a step function improvement in operational efficiency and reliability enabling us to deliver best in class productivity experiences while meeting our sustainability goals.

In Microsoft 365, we operate one of the largest productivity clouds and we need to keep pace with paradigm shifts such as the massive growth in AI workloads, sustainability push, the need for self-managing cloud environments and the challenges posed by the end of Moore’s law and Dennard scaling. Hence, we believe that ramping up our investments in Systems research and Innovation is crucial towards our long-term success.

Careers: We are always on the lookout for motivated and dedicated candidates for Researcher, PostDoc and Internship positions in our team. If you are interested in doing cutting edge research to make our cloud infrastructure more efficient and reliable, please email us your latest CV.

“Without ongoing innovation across the entire hardware/software stack it would be impossible for Microsoft 365 to meet the evolving communication, collaboration, and productivity needs of our ever-changing societies. That innovation must start with research and improvements in the basic sciences of computing, and that research is needed most at the hardware and systems layers because they are the foundation upon which everything else is anchored.”

Jim Kleewein, Technical Fellow, Microsoft

Latest news

Career Opportunity Senior Researcher – Systems and Cloud Intelligence
We are looking for a Senior Researcher with an innovative mindset to design and develop novel machine learning and algorithmic solutions for our cloud infrastructure with a singular purpose of making them scalable, fast, reliable, and efficient.

Career Opportunity Research Intern – AI Assisted Software Engineering
We are looking for a Research Intern to work on machine learning and data-driven techniques for automating and improving all aspects of the software engineering life-cycle including code authoring, code reviews, deployment, monitoring, log analysis and incident management.

Call for Papers Cloud Intelligence/AIOps Workshop @ ASPLOS '24
This workshop provides a forum for researchers and practitioners to present the state of research and practice in AI/ML for efficient and manageable cloud services. Please consider submitting your contributions.

Career Opportunity Principal Research Product Manager – Efficient AI
We are looking for a Principal Product Manager to shape our long-term research strategy and increase product impact. You will work closely with research and engineering teams across Microsoft to identify research opportunities and drive organizational alignment.

Key focus areas

Hardware: We are investing in novel hardware architectures to maximize performance, reduce costs and enable new capabilities. Given our year over year growth in data and proliferation of AI scenarios leveraging deep learning, this effort remains vital in the overall AI strategy and is aimed at offering industry best TCO advantage.

Algorithms: Move towards adaptive and continual learning systems that considers the historical workload patterns and system behaviors in our clusters. We want the system to dynamically adopt policies based on workloads, often optimizing on multiple constraints rather than using default ones which tend to lose the rich context available in our environments. This often leads to very suboptimal decisions affecting our cluster utilization and reliability. We employ context aware modeling, prediction-based approaches and advanced casual inference techniques to optimally manage our clusters workloads/resources without degrading its health.

AI/ML techniques: Explore novel and efficient model architectures to be able to meet both the massive growth in AI inferencing demand as well as the exponential increase in model capacities.

Active projects

Here is a glimpse of the research work happening under each of these areas:

Hardware

Sustainable and cost-efficient cooling from liquid immersion (Zissou)
Custom AI silicon for lowest-cost, energy efficient and performant AI inferencing at scale
Analog-optical computing (opens in new tab) to improve efficiency & sustainability of AI and optimization workloads
Low-cost, power-efficient, and highly-reliable short-reach optical transceivers for data centers and AI clusters

Algorithms

Efficient bin-packing, workload shaping, and global optimization for cloud efficiency
Accelerating developer velocity and enhancing reliability by automated code reviews, safe deployment, triage, and RCA
Workload modelling and characterization for enabling platform optimizations

AI/ML techniques

Automated incident root-causing and mitigation (opens in new tab) using Large-Language models
Efficient inference via multi-scenario fleet optimization, DNN Inference optimization, and model innovation
Proactive hardware failure prediction and mitigation using RL and multi-task learning

We are barely scratching the surface in terms what is possible when combining cutting-edge algorithmic research and state of the art of AI/ML techniques and hardware innovation. We strongly feel that this multi-faceted approach will help propel our infrastructure and services to adapt to the paradigm shifts and enable it to deliver best in class productivity experiences.