Microsoft at NSDI 2026: Advances in large-scale networked systems

Published

By , Partner Research Manager

NSDI ’26 logo in white, centered on a smooth gradient background transitioning from blue to purple and pink.

Large-scale networked systems underpin cloud computing, AI, and distributed applications and services. The USENIX Symposium on Networked Systems Design and Implementation 2026 (opens in new tab) (NSDI ’26) is a leading forum where researchers and practitioners share new research, insights, and advances in the design and operation of these systems.

Microsoft is proud to support NSDI ’26 as a returning sponsor, reflecting our ongoing commitment to advancing systems and networking research and engaging with the broader community. Microsoft researchers and engineering leaders are also serving on the program committee and in other organizational roles.

This year, 11 papers by Microsoft authors and collaborators were accepted to the conference, spanning datacenter and wide-area networks, AI systems, and cloud infrastructure. Together, they highlight advances in building and operating large-scale networked systems.

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Technical sessions

Monday, May 4, 2:00–3:20 PM

DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants (opens in new tab)

Yuhan Liu, Yuyang Huang, Jiayi Yao, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, and Junchen Jiang, University of Chicago; Shan Lu, Madan Musuvathi, and Esha Choukse, Microsoft

DroidSpeak enables LLMs with the same architecture to share and partially reuse KV caches across models, delivering up to 4 times higher throughput and faster responses with minimal impact on output quality.

Monday, May 4, 3:50–5:30 PM

Eywa: Automating Model-Based Testing using LLMs (opens in new tab)

Rajdeep Mondal, Rathin Singha, Todd D. Millstein, and George Varghese, UCLA; Ryan Beckett and Siva Kesava Reddy Kakarla, Microsoft Research

Eywa uses LLMs to automatically build protocol models from natural language sources, enabling model-based testing. It uncovered 33 bugs, including 16 previously unknown, in widely used network protocol implementations.

Tuesday, May 5, 2:00–3:20 PM

Octopus: Enhancing CXL Memory Pods via Sparse Topology (opens in new tab)

Yuhong Zhong, Columbia University; Fiodar Kazhamiaka, Pantea Zardoshti, Shuwei Teng and Rodrigo Fonseca, Microsoft Azure; Mark D. Hill, University of Wisconsin-Madison; Daniel S. Berger, Microsoft Azure and University of Washington

Octopus introduces a switch-free design for disaggregated memory pods that reduces cost and scales to multi-rack pods. On a three-server hardware prototype, Octopus RPCs are 3.2x faster than in-rack RDMA and 2.4x faster than CXL switches.

Tuesday, May 5, 3:50–5:30 PM

Arjun Devraj, Cornell University; Bill Owens, NYSERNet; Umesh Krishnaswamy, Microsoft; Ying Zhang, Meta; Rachee Singh, Cornell University

HEDGE mitigates wavelength-specific faults in optical networks by combining link-local and global network-wide resilience that maintain stable capacity and optimize traffic flow despite fluctuating link performance. It matches existing systems’ throughput while reducing network disruptions.

Wednesday, May 6, 9:00–10:20 AM

AVA: Towards Video Analytics with Vision Language Models (opens in new tab)

Yuxuan Yan, Zhejiang University; Shiqi Jiang, Microsoft Research; Ting Cao, Tsinghua University; Yifan Yang, Microsoft Research; Qianqian Yang and Yuanchao Shu, Zhejiang University; Yuqing Yang and Lili Qiu, Microsoft Research

AVA supports open-ended video analytics by combining event knowledge graphs with agentic retrieval over vision-language models. Furthermore, to evaluate video analytics in ultra-long, open-world scenarios, the authors introduce AVA-100, a benchmark comprising eight videos each exceeding 10 hours and 120 manually annotated, diverse, and complex question–answer pairs, on which AVA achieves 75.8% accuracy.

Wednesday, May 6, 9:00–10:20 AM

SmartNIC-Enabled Live Migration for Storage-Optimized VMs with Pyrocumulus (opens in new tab)

Jiechen Zhao, University of Toronto and Microsoft Research Asia; Ran Shu, Lei Qu, Ziyue Yang, and Rui Ma, Microsoft Research Asia; Derek Chiou, Microsoft and UT Austin; Natalie Enright Jerger, University of Toronto; Peng Cheng and Yongqiang Xiong, Microsoft Research Asia

Pyrocumulus enables fast, low-overhead live migration for storage-optimized VMs through hardware customizability and efficient network accessibility of the FPGA SmartNIC with LM protocol, architecture, and algorithm designs. 

Wednesday, May 6, 10:50 AM–12:30 PM

ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics (opens in new tab)

Liangyu Zhao, University of Washington; Saeed Maleki, Independent Researcher; Yuanhong Wang, Tsinghua University; Zezhou Wang, University of Washington; Ziyue Yang, Microsoft Research; Hossein Pourreza, Microsoft; Arvind Krishnamurthy, University of Washington

ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretical optimality. Its schedule generation runs in polynomial time and is highly scalable. It supports any network fabric, including both switching fabrics and direct accelerator connections.

Wednesday, May 6, 10:50 AM–12:30 PM

Heuristic Analysis from Source Code via Symbolic-Guided Optimization (opens in new tab)

Pantea Karimi, MIT; Siva Kesava Reddy Kakarla and Ryan Beckett, Microsoft Research; Santiago Segarra, Rice University; Pooria Namyar, Microsoft Research; Mohammad Alizadeh, MIT; Behnaz Arzani, Microsoft Research

MetaEase analyzes heuristics directly from source code to uncover worst-case performance scenarios, eliminating the need for complex formal modeling. It matches or outperforms state-of-the-art analyzers across domains and reveals previously unknown performance gaps in real-world systems.

Wednesday, May 6, 2:00–3:20 PM

Harvesting Spare CPU Resources in Container Systems (opens in new tab)

Adam Hall and Anirudh Sarma, Georgia Institute of Technology; Esha Choukse, Microsoft Azure Research; Umakishore Ramachandran, Georgia Institute of Technology; Sameh Elnikety, Microsoft Research

HarvestContainers protects latency-sensitive containers from interference while using their spare CPU cores to run latency-tolerant workloads. It dynamically determines how many cores can be safely harvested and requires no changes to applications or the operating system. It enables up to 75% utilization of spare CPU while keeping tail latency within 4% of standalone performance.

Wednesday, May 6, 3:50–5:30 PM

Offloading Cloud Network Services at Production Scale with SONiC DASH SmartSwitch (opens in new tab)

Community Award Winner

Shaofeng Wu, The Chinese University of Hong Kong and Microsoft Research Asia; Zhixiong Niu, Microsoft Research Asia; Riff Jiang, Lawrence Lee, Junhua Zhai, Ze Gan, Vasundhara Volam, Prabhat Aravind, Prince Sunny, Prince George, Qi Luo, Evan Langlais, Soumya Tiwari, Venkat Satish Katta, Weixi Chen, Rishiraj Hazarika, Sachin Jain, Deven Jagasia, Michal Zygmunt, Avijit Gupta, Neeraj Motwani, and Pranjal Shrivastava, Microsoft; Qiang Su, The Chinese University of Hong Kong; Anil Reddy Pannala, Kristina Moore, James Grantham, Anupam Pandey, Xin Liu, Guohan Lu, Gerald De Grace, Rishabh Tewari, Lihua Yuan, Erica Lan, Deepak Bansal, and Dave Maltz, Microsoft; Yongqiang Xiong, Microsoft Research Asia; Hong Xu, The Chinese University of Hong Kong

SONiC DASH SmartSwitch redesigns cloud network offloading with a hardware-friendly pipeline, unified switch architecture, and open development model while addressing key scalability and deployment challenges. Deployed at scale in Azure, it delivers high throughput and connection capacity while significantly improving power and space efficiency.

Wednesday, May 6, 3:50–5:30 PM

KRAKENGUARD: Towards Fine-Grained eBPF Isolation (opens in new tab)

Jainil Patel, IIT Roorkee; Lucas Graeff Buhl-Nielsen, Quantco; Adrien Ghosn, Microsoft; Marios Kogias, Imperial College London

KRAKENGUARD enforces fine-grained, policy-based controls on eBPF programs at load time using symbolic execution, enabling safe use in multi-tenant environments without relying on coarse Linux capabilities. It prevents malicious behavior, detects vulnerabilities, and allows for secure execution of untrusted programs with strong isolation guarantees.

Symposium organizers from Microsoft

Program Committee

Ganesh Ananthanarayanan
Behnaz Arzani
Hitesh Ballani
Ryan Beckett
Ranveer Chandra
Paolo Costa
Rodrigo Fonseca
Xenofon Foukas
Kevin Hsieh
Umesh Krishnaswamy (opens in new tab)
Jing Liu
Jonathan Mace
Dave Maltz
Sathiya Mani
Dushyanth Narayanan
Suman Nath
Ram Ramjee
Stefan Saroiu

Steering Committee

Sujata Banerjee
Jay Lorch

Related publications