Research Faculty Summit 2018

Co-Chairs: Dan Bohus, Debadeepta Dey, and Siddhartha Sen[Video]

As systems become increasingly complicated, cater to large geographical areas, have to seamlessly utilize an incredibly diverse array of computational resources and serve real-time, safety and mission-critical applications there is an emerging need for them to be self-aware or self-tuning in nature. Advances in machine learning and artificial intelligence have recently led to algorithms which can learn high-performance policies over extremely large state spaces (e.g. solving games like Ms. Pacman, Go, Poker or learn self-driving policies for autonomous cars, drones, etc). Just as the growth of cheap abundant computing and specialized systems (e.g. dedicated accelerators for deep learning) has led to rapid advances in machine learning and artificial intelligence, there is an emerging opportunity for machine learning to help systems back. In this session, we want to explore the technical opportunities and unique challenges that surface when applying machine learning to optimize large-scale distributed systems. Specifically, we want to explore challenges in developing systems which are self-tunable, resource-aware and use machine learning to dynamically optimize a running system to achieve desired latency, throughput, and other system-dependent utility functions. Making significant progress in this area requires multiple disciplines coming together, namely: machine learning, decision-making, distributed systems, and optimization.

Wolong: A Backend Optimizer for Deep Learning Computation and Open PAI

Chair: Jilong Xue

[Video]

This talk presents Wolong, a backend optimization system for accelerating deep learning computation on graphics processing units (GPUs). It automatically applies operator batching and kernel fusion to avoid system overhead such as scheduling and kernel launch by analyzing the dataflow graph. This optimization is transparent to the AI model developer and, by integrating with TensorFlow, it can accelerate deep learning computation rates by an order of magnitude.

Open PAI: Open Source Initiative for AI Platform in China

Chair: Fan Yang

Open Platform for AI (Open PAI) is an open source platform for graphics processing unit (GPU) cluster management and resource scheduling. The platform’s design has been proven in Microsoft’s large-scale production environment.

Chair: Vani Mandava[Video]

Today, many of the technical products and services we use are created by one largely homogeneous group of people: men. That doesn’t make today’s technology bad or unhelpful. However, it does mean our society is missing out on other perspectives. Considering how much technology drives our economy and our lives, it makes sense to consider this question: “What if women, and all groups underrepresented in computing, were at the technology design and research table alongside white males?” Research tells us we would both solve different problems and solve problems differently. This panel of computing innovators will explore the possibility for exciting outcomes if more perspectives were at the table.

Chair: Manuel Costa[Video]

Confidential computing allows users to upload encrypted code and data to the cloud and get encrypted results back with guaranteed privacy. Confidential computing means cloud providers can’t see customers’ secrets even if cloud administrators are malicious or hackers have exploited kernel bugs in hosts. This session discusses research on confidential computing, including secure hardware containers, operating systems, compilers for secure code generation, cryptography, and redesigning cloud services.

Chair: Brendan Murphy[Video]

This session will provide academic researchers the opportunity to discuss their work in this space and Microsoft practitioners to describe the current state of continuous deployment practices within Microsoft product groups. This discussion will describe the practical issues encountered by deploying software for different classes of products, such as understanding customer acceptance of continually changing interfaces. The speakers will also describe the future directions and challenges that they foresee from both a technical and engineering perspective and the transformative role AI can play in continuous deployment.

Co-Chairs: Stefan Saroiu and Alec Wolman[Video]

In the quest for higher performance, modern CPUs make heavy use of speculative execution. Unfortunately, speculative execution opens the possibility of side-channel attacks in which malicious parties can read the memory of co-located processes, OSes, and VMs (e.g., Meltdown, Spectre). Similarly, in the quest for higher memory capacities, modern DRAMs have drastically increased the density of memory cells on a chip. This high cell density opens the possibility of attacks that cause bit flips in DRAM (e.g., Rowhammer). A single bit flip is sufficient to lead to serious security breaches, such as privilege escalation, remote login, or factoring an RSA private key.

Unfortunately, no single silver bullet for stopping these types of attacks exists. These attacks all stem from hardware “bugs.” While fixing each particular bug is feasible, the hardware life-cycle is very long, and the fixes often come with serious performance and cost overheads. Software-based fixes offer a faster response, but also may impose significant overhead. The goal of this session is to discuss the state of the art techniques in performing such attacks and defending against them using both hardware and software.

Chair: Arvind Arasu[Video]

Blockchains are an emerging technology that promises to transform contracts and transactions between mutually untrusted entities. The goal of this session is to explore current trends in blockchain systems with a set of presentations focusing on performance and security of blockchain systems and on connections between blockchains and traditional database systems.

Chair: Surajit Chaudhuri[Video]

Increasing availability of the data, coupled with breakthroughs in Machine Learning and AI, have raised the ambitions of the enterprises to exploit insights from data to transform their businesses. This has challenged the data platform builders to architect platforms that support the exploration of insights in an efficient and cost-effective manner. Conversely, it has raised the hopes of data platform architects that the telemetry data captured by the data platforms can be harnessed bring the data-driven innovation to customize data platforms and make them adapt to the workload and data characteristics. In this session, we will explore this duality of “Data Platforms for AI” and “AI for Data Platforms”.

Co-Chairs: Matthai Philipose and Amar Phanishayee[Video]

The fact that many commonly used networks take hours to days for training has motivated recent research towards reducing training time. On the other hand networks, once trained, are heavyweight dense linear algebra computations, usually requiring expensive acceleration to execute in real time. However, recent advances in algorithms, hardware, and systems have broken through these barriers dramatically. Models that took days to train are now reported to be trainable in under an hour. Further, with model optimization techniques and emerging commodity silicon, these models can be executed on the edge or in the cloud at surprisingly low energy and dollar cost. This session will present the ideas and techniques underlying these breakthroughs and discuss the implications of this new regime of “free inference and instant training.”

Chair: Saikat Guha

DataMap is the underlying technology platform that powers much of Microsoft’s technology stack for complying with privacy legislation worldwide, including with the European Union’s General Data Protection Regulation (GDPR). It grew out of a Microsoft Research project published at the IEEE Symposium on Security and Privacy. This talk chronicles key moments and insights from DataMap’s journey from a research project to a production system.

Chair: Ant Rowstron

What will cloud storage of the future look like? When you build storage at the cloud-scale, this creates new opportunities as well as new challenges. In this session, our panel of cloud-storage leaders will focus on discussing innovations at multiple levels of the storage stack. They’ll answer many of the basic questions including:

What is the media of the future – glass and/or DNA?
How do we push for more performance at lower storage costs?
What does tiering look like in the cloud era?

Cloud storage leaders will answer these questions and many more while sharing their visions, views, and insights.

Co-Chairs: Yibo Zhu and Hitesh Ballani[Video]

Emerging networked systems, e.g., distributed storage and machine learning platforms, demand high-performance networking. Advanced networking hardware, including optical switches, programmable switches, RDMA NICs and smart NICs can answer this need. However, extracting the required step change in performance and functionality poses many challenges and will necessitate further hardware innovation. Hardware innovations in isolation, however, are not sufficient–they enable yet also require new architectures for the network and for applications. For example, new ultra-fast optical switching technologies will likely necessitate a step away from the traditional packet-switched network model. More flexible programs on switches will require new management frameworks. Even commodity hardware such as RDMA over Ethernet creates new problems such as congestion spreading and deadlocks. This session will bring together thought leaders in Microsoft and in academia to rethink how we co-design networked systems and applications with advanced networking hardware to fuel the cloud of the future.

Chair: Victor Bahl[Video]

Edge computing is a natural evolution of cloud computing, where server resources, ranging from a credit-card size computers to micro data centers, are placed closer to data and information generation sources. Application and system developers use these resources to enable a new class of latency- and bandwidth-sensitive applications that are not realizable with current mega-scale cloud computing architectures. Industries, ranging from manufacturing to healthcare, are eager to develop real-time control systems that use machine learning and artificial intelligence to improve efficiencies and reduce cost and the Intelligent Edge is in the center of it all. In this session, we will explore this new computing paradigm by discussing emerging challenges, technologies, and future business impact.

Chair: Muthian Sivathanu

The massive scale of cloud infrastructure services enables, and often necessitates, vertical co-design by cloud providers and other vendors of the infrastructure stack. Unlike the traditional model where different vendors built different layers of the stack with generic interfaces, cloud infrastructure providers often control the entire stack, which provides the ability for different companies to co-design. Software-defined networking is an example of a co-design that has resulted in significant efficiency. However, most efforts at co-design have been massive engineering efforts involving ground-up rebuilding of the system—a not very efficient process.

Micro co-design is a MInimally invasive, Cheap, and retRO-fittable approach to co-design that extracts efficiency out of existing software infrastructure layers. It does this by making lightweight changes to generic software interfaces. This talk will describe multiple systems we have built with this approach in the space of big data analytics and deep learning infrastructure, demonstrating efficiency and functionality improvements that are both pragmatic and low-cost.

Chair: Dan Ports

A new generation of programmable hardware in the data center, (e.g., FPGAs, programmable dataplane switches, and smart NICs) provide reconfigurable processing capabilities that are exposed to systems designers. Many of these components come from the networking world, where they are applied to traditional-packet processing tasks.

In this session, we ask whether they can be applied to distributed systems in order to achieve higher performance and reliability. The key to this approach is looking at the entire stack in order to co-design systems and hardware. We will look at several recent systems that apply this approach to different application domains and underlying hardware platforms.

Chair: Matthias Troyer

Traditional technology scaling trends have slowed, motivating many to proclaim the end of Moore’s law and the end of CMOS process technology. Instead of a cataclysmic end to computer systems as we know them, an evolution of new technologies is about to happen. One such technology is quantum computing, which holds the promise to solve classically intractable problems – ones that are completely out of the realm of today’s classical computing capabilities.

Until recently, quantum computing has been a field that has only been accessible to physicists and mathematicians. Pioneering work in the quantum field is accessible to software developers and classical computer architects who are looking to develop full-scale quantum systems.

This session will cover some of the methodologies, tools, and architectures that will be required to build, deploy and use a quantum computer. These full stack solutions cover quantum algorithm development, software tools, classical computer architecture and the quantum plane. The quantum plane in particular poses challenges that are very different from today’s classical computers.

Co-Chairs: Ganesh Ananthanarayanan, Junchen Jiang, Venkat Padmanabhan, and Siddhartha Sen[Video]

We are witnessing a huge surge in efforts and interest in developing machine-learning (ML) based solutions for optimizing large-scale networked systems. With the unprecedented availability of data and computing power, the early evidence on using ML for systems has been promising.

Microsoft has been at the forefront of applying data at massive scales to address pressing issues in large-scale systems. At the same time, there is a growing voice in the systems and networking community for more “principled” solutions. They are concerned that building solutions we don’t fully understand will ultimately come back to bite us.

This session intends to wade into this storm! What are the networked systems problems for which ML-based techniques are appropriate? Does the fact that the network is often a black box, with much uncertainty with regard to its state, make ML more or less appropriate for networked system than for systems in general? What is the right mix of ML and traditional modeling and algorithms? Do we really need a full understanding of solutions as the “traditionalists” insist?

Leading researchers in the field will discuss how cutting-edge ML advances can be applied to networked systems and lay out principles for ML-based networking and systems research in the coming years.

Chair: Chris Hawblitzel[Video]

Bugs in security-critical system software already cost society billions of dollars, and the need for secure software is increasing as more devices are connected to the Internet. This session will outline the security needs of network-connected systems and explore how formal verification can help secure them.

We’ll present research on bringing high-value security to low-cost devices, particularly those powered by microcontrollers – a class of devices ill-prepared for the security challenges of Internet connectivity. We’ll also discuss advances in verification tools and techniques that lead to real, usable verified software, with an emphasis on critical systems such as distributed and operating systems, cloud infrastructure, networking protocols and cryptography. Finally, we’ll present an overview of the Azure Sphere product as part of Microsoft’s efforts to secure MCU-based devices.

AI for AI Systems

AI Infrastructure and Tools

Wolong: A Backend Optimizer for Deep Learning Computation and Open PAI

Open PAI: Open Source Initiative for AI Platform in China

Computing Innovation and Diversity of Thought

Confidential Computing

Continuous Deployment: Current and Future Challenges

CPU and DRAM Bugs: Attacks and Defenses

Current Trends in Blockchain Technology

Database and Data Analytic Systems

Free Inference and Instant Training: Breakthroughs and Implications

From Paper to Production – Privacy Compliance Systems at Scale

Future of Cloud Storage Systems

Hardware-accelerated Networked Systems

Intelligent Edge

Micro Co-design for Efficient Cloud Infrastructure

Programmable Hardware for Distributed Systems

Quantum Computers: Software and Hardware Architecture

The Good, the Bad, and the Ugly of ML for Networked Systems

Verification and Secure Systems