Microsoft Azure and Microsoft Research take giant step towards eliminating network downtime

Published

By , Technical Fellow & Chief Technology Officer, Azure for Operators

At the 26th ACM Annual Symposium on Operating Systems and Principles, better known as SOSP 2017 (opens in new tab), my colleagues described a new technology called CrystalNet (opens in new tab) – a high-fidelity, cloud-scale emulator that helps network engineers nearly eliminate network downtime related to routine maintenance and upgrades as well as software bugs and human errors.  A collaboration by Microsoft Azure (opens in new tab) and Microsoft Research teams, CrystalNet was developed through the application of two years’ worth of research to create an emulator bulletproofed by Azure network engineers who operate one of the largest networks on the planet.  The result is a first-of-its-kind set of tools that help significantly decrease network downtime, and increase availability for Azure customers.

Cloud networks are constantly growing and evolving. They consist of immensely complex and massive cloud-scale production networks, interconnecting hundreds of thousands of servers using millions of wires and tens of thousands of network devices that are sourced from dozens of vendors and deployed across the globe with stringent needs for reliability, security and performance. It’s imperative to continually monitor the pulse of the networks, to detect anomalies, faults, and drive recovery at the millisecond level, much akin to monitoring a living organism. Networks are essentially the cloud, as they are the core infrastructure that hold up cloud services and helps deliver availability across other key services such as compute and storage. However, there are few currently available tools at the disposal of cloud providers to proactively foresee the impact of planned updates and changes, or a bug in the system. Before making any changes to the network, engineers can run the proposed changes through our verification tools, which check the impact of the configuration changes before green-lighting them for deployment in our production networks.

Emulating a Cloud-Scale Network
The idea of testing before deploying is age old, but following a two-year study by Microsoft Research looking at all documented outages across all major cloud providers, we believed that we could find most potential problems proactively if we first validated a production network on an identical copy of the network. By identical copy, we literally meant using the same network topology, hardware, software and configurations as in our production network. Using the cloud to build the cloud is a common design pattern for creating and running an enterprise high-performance cloud, but replicating the physical hardware of the entire network is too expensive and where would we put it? To more cost efficiently solve this problem, we run hardware, software and network configurations on virtualized hardware interconnected exactly as the production network architecture. In effect, we create a large-scale, high-fidelity network emulator that allows Azure engineers to validate planned changes and gauge the impact of various updates and failure scenarios. We call our network emulator “CrystalNet,” seeing the future of your network via a crystal ball.

Microsoft research podcast

Ideas: Designing AI for people with Abigail Sellen

Social scientist and HCI expert Abigail Sellen explores the critical understanding needed to build human-centric AI through the lens of the new AICE initiative, a collective of interdisciplinary researchers studying AI impact on human cognition and the economy.

Figure 1: The architecture of CrystalNet

CrystalNet faithfully emulates large scale production networks to significantly decrease network incidents caused by software, configurations and human errors. Unique properties include:

  • Scalability: CrystalNet builds virtual network links among emulated devices across different Virtual Machines, also known as VMs. By adding additional VMs into the emulation cluster, CrystalNet naturally scales to emulate larger and growing networks.
  • Flexibility: CrystalNet supports a diverse range of software images of network devices from different vendors. These software images are either in the form of standalone VMs or Docker containers. To accommodate and manage these uniformly, we mock-up the physical network with containers and run heterogeneous device software images on top of our containers’ network namespace. Network engineers don’t have to learn device management tools, instead they can access these device images via Telnet or SSH.  Furthermore, we can include real hardware devices in our emulated network transparently.
  • Real world cost efficiency: An emulated network must have boundaries to avoid the unrealistic task of reproducing an entire network. CrystalNet is able to automatically identify boundaries to help minimize the number of devices that need to be emulated. In this way, CrystalNet saves cost while making sure that the behavior of the emulated network will be identical to the real network.

Networking Azure regions

Microsoft Azure empowers its customers to use server resources at locations within a geographical area it calls “regions.”  Azure regions are defined by a bandwidth and latency envelope which contain one or more datacenters within a geographic area.  Customers can specify the Region where their customer data will be stored.  As of this post, Azure has announced 42 regions globally, more than any other cloud provider, and 69 compliance offerings, the most comprehensive compliance coverage in the industry.  Plus, new Azure Availability Zones provide new levels of resiliency for high-availability within a region and across regions. Previously, inter-data center, or intra-region, traffic was carried in Microsoft’s wide-area network. As the demand for high capacity, low latency networks between datacenters within the same region grew, the overall network design was upgraded and improved. The new design includes new regional backbone networks that interconnect the data centers inside the same region, bypassing the wide-area network. When transitioning to regions, a challenging task was to evolve an existing network architecture to a different one without incurring any downtime. They had to keep in mind hardware capacity limitations, shortage of IPv4 addresses, load on network switches, impact of configuration changes on routers and network security.

The Proof is in the Pudding
Azure engineers have been using CrystalNet to successfully emulate Azure’s production networks. Specifically, our colleagues in Azure Networking have used CrystalNet’s unique capability to validate and reduce the risk of new network designs, major network architecture changes, network firmware/hardware upgrades and network configuration updates. They have also been using it as a realistic test environment for developing network automation tools and for developing our in-house switch operating system, Software for Open Networking in the Cloud, called SONiC (opens in new tab).

CrystalNet was a critical tool to enable migration of Microsoft Azure’s regional backbones from old architectures to a new standardized architecture with zero user impacting incidents, even though production traffic flowed through the network continuously during the migrations. Making such major changes to an operational network is often fraught with peril. Operators must guarantee that no noticeable disruption or degradation of existing traffic in the region will happen during or after the migration. CrystalNet allowed Azure service engineers to intensively validate and refine their operational plans and software tools. They discovered and resolved several hard to find bugs in tools and scripts, thus averting potential outages. The final migration plan, verified and perfected on CrystalNet, did not trigger a single incident. There were no incidents of casual human errors such as typos either, which the engineers attributed to intensive practice sessions on the emulator.

Figure 2: Traffic before migrating to regional backbones

Figure 3: Traffic after migrating to regional backbones

Looking into the Crystal Ball
CrystalNet’s ability to accurately emulate a large complex network is powerful. Azure customers have expressed interest in our network emulator because they see it as a way to reduce downtime in their own enterprise networks. Network device vendors are equally keen to use CrystalNet to test their products before putting them on the market. Our colleagues in academia and industrial research labs tell us that they want to use CrystalNet to experiment and explore new and groundbreaking ideas. This genuine interest in CrystalNet is driving us to continuously work with Azure engineers to perfect the technology.

Related:

Continue reading

See all blog posts