Return to Microsoft Research Lab – Asia

Networking Infrastructure Group

Next-Gen Data Center Infrastructure

Hardware-assisted Virtualization Disaggregation
- Terminus: The key idea of Terminus is to organize all kinds of resources to form a unified resource pool in the cloud, and all the resources are partitioned and virtualized. By this means, cloud providers could improve the resource utilization for its public cloud, and our customers will experience bare metal performance with the highest security level.
- DUA: DUA is a unified framework for FPGA applications to access all resources in data center without CPU involvement by leveraging existing FPGA and networking infrastructure. With DUA, FPGA applications are provided a unified address format and a single set of communication API to access all resources, regardless of the location (remote or local), or the type of the target device (CPU, GPU, FPGA DRAM, server DRAM, SSD, etc.).
- S-Direct: It is a solution for performant and transparent flash storage disaggregation. The system is based on FPGA-based SmartNIC, with no dependency on remote resources other than the storage itself and the network bandwidth. It introduces a novel approach to offload the NVMe data path to SmartNICs, providing remote access latency almost identical to local. It also supports an efficient and precise QoS scheduling technique that minimizes performance impact on shared flash. Currently, S-Direct is deployed in real-world applications.
- SDR: Software-defined rack (SDR) is a research sandbox for rack-scale focused disaggregation. We designed and assembled SDR hardware, which uses a commodity PCIe/CXL switch network to allow resource disaggregation at rack-scale and ensure microsecond or sub-microsecond access latencies. In addition, we aim to design and develop a system solution, called RackOS, for the SDR hardware, to address many disaggregation challenges, such as achieve runtime resource allocation/deallocation, transparent fail-over, and fine-grained QoS.
Programmable Networking
- SmartToR: This project aims at make ToR switches smart and work together to offload cloud applications.
- SmartNIC Service: SmartNICs are broadly deployed in today’s data centers due to their performance, energy efficiency, and programmabil- ity benefits. They support a wide variety of offloads to serve host applications, from network functions to distributed ap- plications. However, the tight coupling between SmartNICs and host machines makes it difficult to scale SmartNIC re- sources and causes resource underutilization. To address these issues, we articulate a vision of providing SmartNIC as a service.
- SONiC Chassis
- SONiC Web App: We are designing a new application type for SONiC called SWA (SONiC Web Application). It has Javascript-like grammar with built-in SONiC libraries support and covers both the control plane and data plane. It contains a customized compiler, a run-time daemon inside SONiC OS and an IDE based on VSCode. It will dramatically reduce the learning curve for SONiC development and boost the SONiC’s popularity in the NOS (Network Operating System) area.
Intelligent Host Networking
- NetKernel: It decouples the network stack from the guest virtual machine and offers it as an independent module. NetKernel represents a new paradigm where network stack can be managed as part of the virtualized infrastructure. It provides important efficiency benefits: By gaining control and visibility of the network stack, operator can perform network management more directly and flexibly, such as multiplexing VMs running different applications to the same network stack module to save CPU. Users also benefit from the simplified stack deployment and better performance.
- PipeDevice: It is a new hardware-software co-design approach for low overhead intra-host container communication.

Next-Gen AI Infrastructure

Performance
- Tutel (opens in new tab): Tutel is a high-performance Mixture-of-Experts (MoE) library to facilitate the development of large-scale DNN models and aim to be the future foundation for extreme-large model training and inference. Tutel is the first MoE framework to enable 4,096 A100 MoE training on Azure. With diverse and flexible MoE algorithmic supported by Tutel, developers across AI domains can execute MoE easier and more efficiently. Tutel has been open sourced at https://github.com/microsoft/tutel (opens in new tab) and part of whose features have been integrated into PyTorch Fairseq, ORT MoE, DeepSpeed, etc.
- MSCCL: It is an inter-accelerator communication framework that is built on top of NCCL and uses its building blocks to execute custom-written collective communication algorithms. MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms across multiple accelerators.
- ARK: It presents a software-hardware co-design to implement an AI framework that is autonomously executed by GPUs without CPU involvement.
- MAGE: It aims to providing a deep insight on memory-side bottlenecks of AI workloads and proposing hardware-agnostic metrics for evaluating performance of memory subsystems in AI accelerators.
Reliability
- SuperBench: It is an automation system for performance validation and diagnosis in Azure AI infrastructure. It is used to guarantee the hardware quality before delivering VMs to Azure customers and identify defects with hardware failures or performance regression, which has helped Azure in shipping qualified emerging GPU SKUs including A100, MI200, etc. SuperBench has been open sourced at https://aka.ms/superbench (opens in new tab), aiming to provide a performance contract for AI hardware among customers, cloud providers, and hardware vendors.
- Moneo: Moneo is a non-intrusive cloud-friendly monitoring system, which is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads. Moneo has been open sourced at https://aka.ms/moneo (opens in new tab) for Azure customers.
- NPKit: It is an online automatic detection and diagnosis framework for communication collective library (NVIDIA NCCL, AMD RCCL and Microsoft MSCCL (opens in new tab), etc.) used in large-scale AI workload. It aims to provide non-intrusive lightweight solution to quickly find the issues on node or/and IB including NIC regression, port flapping, latency inflation, etc. With NPKit, users can quickly figure out the root cause and resume their large-scale job. NPKit outputs detailed timeline in Google Trace Event Format, by which users can leverage trace viewer to deeply understand and analyze their workflow behavior.
Prediction
- AISim: It can provide additional support throughout the life cycle of our AI products and services, especially from design to operation, to ensure that they comply with the fast-increasing carbon neutral requirements while maintaining huge business impact growth.

AI for System and Networking

AI for Security
- Bot Detection: One of MSRA-NRG’s goals is to improve the precision and recall of web crawler (bot) detection so that our customer’s website data will not be compromised. We have been working on various solutions to detect the bots, including traditional rule-based bot detection methods based on User-Agent, Client IP and advanced methods based on deep learning models like LSTM, CNN, GCN. Our techniques serve all consumers within Microsoft.
AI for Networking
- R3Net:
- OpenNetLab: It aims to build and provide a distributed networking platform with many collaborative nodes and a common benchmarking dataset (i.e. ImageNet in the networking area) for researchers to collect real networking data and train/evaluate their AI models for various networking environments, including the Internet/cloud, and wireless and mobile networks. Website: https://opennetlab.org (opens in new tab)