Motivation
Future telecom network infrastructure will be built on top of global cloud infrastructure, such as Microsoft’s Azure. Telecom network infrastructure refers to the vast telecom equipment installations and network wirings that make our communication networks work. This includes the cell towers, the cable head-ends, the wiring closets, the telco central offices, the switches and servers, cables and fibers, the whole nine yards. This is the behind-the-scenes reason that we can enjoy our modern communications today, such as smartphones, TVs, and the Internet.
For decades this has been a very mature industry segment. Hardware is manufactured by specialized telecom vendors like Huawei and Ericsson. Operators like AT&T procure the hardware, install the network, and conduct the daily operations. It is a close ecosystem, dominated by incumbents. But this is about to change. Operators are under tremendous pressure from over-the-top (OTT) providers (such as Facebook) to innovate, but the traditional infrastructure is rigid, unable to support fast-moving new business. With 5G on the horizon, the world’s telecom network infrastructure is ripe for total renovation.
At the same time, the IT industry has seen an even bigger phenomenon, the cloud computing revolution. Cloud companies like Microsoft have built out a massive global compute infrastructure. Microsoft’s Azure, for example, has 60+ regions, hundreds of data centers and edge sites, millions of server cores, and hundreds of thousand miles of fibers and subsea cables. As Microsoft’s CEO Satya Nadella puts it, “Azure is being built as the world’s computer.”
And Azure can also be the world’s core telecom network infrastructure. Azure is built with the world’s most advanced compute and networking technologies. The Azure global cloud network, consisting of more than 130,000 miles of lit fiber and undersea cable systems, has the latest software-defined network (SDN) and latest fiber-optics. It has the world’s biggest FPGA cloud. In terms of capability and capacity, Azure has already surpassed telecom’s own. This is the technologic prognosis for project Arno – the future of telecom network infrastructure will be built on the global cloud infrastructure like Azure’s.
Technical Approach
The technical approach is to “cloudify” the telecom network infrastructure – reimagining them as virtualized elements that can reside in the general cloud infrastructure and redesigning them into cloud-size services that will auto-deploy and self-scale. For example, the telecom hardware like switches and routers can be realized as virtual machines or containers hosted in Azure’s cloud and edge data centers. The fibers and cables can be virtual links inside the Azure’s cloud network. The operation and business support systems, known as OSS/BSS, can also be cloud software much like any other business software used in other industries. Further, the specific telecom functions should be redesigned as cloud services or microservices. The benefit is that they can now self-scale. Traditionally, business expansion (such as adding new subscribers) often requires the operators to install new hardware. But under cloudified infrastructure, it is a matter of more resources being allocated to the specific cloud services, which is done automatically and intrinsically in hyperscale clouds like Azure.
Implementing this design is of course no small feat. There are serious technical challenges. Azure was designed for general cloud computing, not specifically for telecom. Indeed, the industry was holding the opinion that general-purpose public clouds were not suitable for telecom infrastructure. Could it be fast enough? Could it be reliable enough? What about “5 nines” (the requirement that a telecom service must be available >99.999% of all time)? All these were good questions with no answer. This was never done before.
Research Problems
To prove the feasibility of cloudifying telecom network functions, and to gain operational experience, Project Arno picked a complicated system called EPC (Evolved Packet Core) to refactor it to be “cloud-native” on Azure. There are typically hundreds of network functions that make up the entire telecom network infrastructure, out of which EPC is one of the most important and complicated components. EPC is the core of a 4G LTE network – where mobile data are connected to services provided by the mobile operators or on the Internet. EPC also manages every aspects of the LTE network, including subscription, session, mobility, and QoS. Traditionally, it is a piece of large and expensive equipment made of high-speed switches and high-performance packet processing boards. By early 2017, a few companies (like Affirmed Network) have started producing the virtualized version of EPC (known as vEPC) in the form of software runnable on commodity PC servers. But no one had yet to try building EPC in the cloud as a cloud-native service.
Project Arno took the source code of a commercial vEPC and reconstruct it into a “cloud-native” service on Azure. This new MSR cloud EPC can run on any Azure data center, self-replicating on demand and self-load-balancing based on traffic. It is deployable to any Azure region in 5 minutes. This is how the telecom network function should be built in the era of cloud, like Microsoft researchers have shown.
Next came the question of performance. Many telecom functions need to process network packets in a split of a millisecond. Can the cloudified telecom function perform as fast as the specialized hardware? Working with Azure Hardware team, Project Arno researchers have developed a data-plane acceleration method that can speed up telecom network functions up to 100 times. The solution is to leverage Azure’s massive FPGA cloud. Azure has built one of the largest FPGA clouds in the world called Brainwave and use it to accelerate AI and Machine Learning. Accelerating telecom functions can serve as another application for Brainwave.
To prove this solution, Project Arno researchers optimized MSR cloud EPC with Azure FPGA. An EPC system typically needs to handle millions of packets per second (Mpps). By early 2018, Project Arno’s cloud EPC was able to achieve 55Mpps with one Arria-10 FPGA and virtually no CPU utilization. Comparatively, commercial vEPC products announced by various incumbent telecom vendors were only able to do somewhere between 5Mpps and 20Mpps on much more expensive commodity servers with many CPU cores. Project Arno has succeeded in showing that a cloudified EPC can outperform telecom’s own vEPC products. The telecom industry did understand the importance of FPGA or hardware acceleration. They just did not know how to do it well in a virtualized or cloud environment. Here we were already years ahead.
Another question is availability. Can public clouds, made from millions of commodity servers, match the expensive and purposely hardened telecom hardware, in the industry doctrine of “5 nines” (99.999% uptime)? The traditional method widely used in telecom is through redundancy. For a critical component, operators often install an extra identical unit as “hot backup” – when the first unit fails, the second one will take over. Traditional hot backup approaches typically require special hardware and low-level software provisioning of each deployment. This custom provisioning is not possible in a hyperscale cloud due to its scale. We had to look for more flexible and less expensive ways to achieve the same availability. If we implement a network function with Azure features like Availability Set and Availability Zones, it will be deployed across all fault domains in different parts of a data centers, across multiple data centers, or even across different cloud regions to maximize survivability under system failures, operational errors, or even natural disasters.
Traditional telecom functions have a design deficiency of keeping lots of states in the network (compared to modern Internet design that keeps minimal state in the network). This design works fine with expensive telecom hardware that seldom fails but becomes a big problem for cloud-scale availability because the states need to be in sync across the cloud and the synchronization must be fast enough to not impede the telecom function’s performance. This is a serious distributed system challenge. Project Arno researchers spent months to research and develop a state sync mechanism that meets the telecom function requirements.
These are examples of critical technology that is needed for cloudifying the telecom infrastructure. Project Arno researchers have been working hard over the past several years to make them ready.