The motivation behind Systems Innovation research
Systems Innovation is a joint collaboration between Microsoft 365, MSR and Azure focused on leveraging our deep workload understanding and combining algorithmic research with AI/ML techniques and hardware innovation to provide a step function improvement in operational efficiency and reliability enabling us to deliver best in class productivity experiences while meeting our sustainability goals.
In Microsoft 365, we operate one of the largest productivity clouds and we need to keep pace with paradigm shifts such as the massive growth in AI workloads, sustainability push, the need for self-managing cloud environments and the challenges posed by the end of Moore’s law and Dennard scaling. Hence, we believe that ramping up our investments in Systems research and Innovation is crucial towards our long-term success.
Careers: We are always on the lookout for motivated and dedicated candidates for Researcher, PostDoc and Internship positions in our team. If you are interested in doing cutting edge research to make our cloud infrastructure more efficient and reliable, please email us your latest CV.
“Without ongoing innovation across the entire hardware/software stack it would be impossible for Microsoft 365 to meet the evolving communication, collaboration, and productivity needs of our ever-changing societies. That innovation must start with research and improvements in the basic sciences of computing, and that research is needed most at the hardware and systems layers because they are the foundation upon which everything else is anchored.”
Jim Kleewein, Technical Fellow, Microsoft
Latest news
- Career Opportunity Senior Researcher – Systems and Cloud Intelligence
- Career Opportunity Research Intern – AI Assisted Software Engineering
- Call for Papers Cloud Intelligence/AIOps Workshop @ ASPLOS '24
- Career Opportunity Principal Research Product Manager – Efficient AI
Key focus areas
Hardware: We are investing in novel hardware architectures to maximize performance, reduce costs and enable new capabilities. Given our year over year growth in data and proliferation of AI scenarios leveraging deep learning, this effort remains vital in the overall AI strategy and is aimed at offering industry best TCO advantage.
Algorithms: Move towards adaptive and continual learning systems that considers the historical workload patterns and system behaviors in our clusters. We want the system to dynamically adopt policies based on workloads, often optimizing on multiple constraints rather than using default ones which tend to lose the rich context available in our environments. This often leads to very suboptimal decisions affecting our cluster utilization and reliability. We employ context aware modeling, prediction-based approaches and advanced casual inference techniques to optimally manage our clusters workloads/resources without degrading its health.
AI/ML techniques: Explore novel and efficient model architectures to be able to meet both the massive growth in AI inferencing demand as well as the exponential increase in model capacities.
Active projects
Here is a glimpse of the research work happening under each of these areas:
Hardware
- Sustainable and cost-efficient cooling from liquid immersion (Zissou)
- Custom AI silicon for lowest-cost, energy efficient and performant AI inferencing at scale
- Analog-optical computing (opens in new tab) to improve efficiency & sustainability of AI and optimization workloads
- Low-cost, power-efficient, and highly-reliable short-reach optical transceivers for data centers and AI clusters
Algorithms
- Efficient bin-packing, workload shaping, and global optimization for cloud efficiency
- Accelerating developer velocity and enhancing reliability by automated code reviews, safe deployment, triage, and RCA
- Workload modelling and characterization for enabling platform optimizations
AI/ML techniques
- Automated incident root-causing and mitigation (opens in new tab) using Large-Language models
- Efficient inference via multi-scenario fleet optimization, DNN Inference optimization, and model innovation
- Proactive hardware failure prediction and mitigation using RL and multi-task learning
We are barely scratching the surface in terms what is possible when combining cutting-edge algorithmic research and state of the art of AI/ML techniques and hardware innovation. We strongly feel that this multi-faceted approach will help propel our infrastructure and services to adapt to the paradigm shifts and enable it to deliver best in class productivity experiences.