To ensure that employees had a reliable hybrid work experience at the onset of the COVID-19 pandemic, Steve Means, principal cloud network engineering manager in Microsoft Digital Employee Experience, and his team set out to make sure that the company’s internal network would hold up.
They were cautiously optimistic—the team had just rebuilt the entire network, including the virtual private network (VPN). This network supports access to key internal servers with protected data, personnel information, and other critical assets that must be on lockdown.
“Our network has done very well for employees to working remotely,” Means says. “So far, we’ve seen a really strong performance from our network and VPN, specifically.”
The strong response has been fueled by an earlier decision the team made to reduce the workload that the company pushes through its VPN pipes. The team did that by implementing split tunneling at most of its locations worldwide, which funnels the majority of the company’s mobile workload to the internet.
Split tunneling became possible because Microsoft is nearly 100 percent in the cloud, which allows its remote workers to access core applications and experiences over the internet via Microsoft Azure and Office 365. Before the company migrated to the cloud, everything would have been routed through VPN.
“It really helps us that most of our mobile workload—including traffic to high volume and performance sensitive Office 365 and Azure applications—is securely routed directly over the internet,” Means says.
In retrospect, adopting split tunneling was a pivotal decision.
“It is allowing our employees to maintain their normal level of productivity even as they work remotely,” he says.
He pointed to how employees are now using Microsoft Teams as an example.
“Our employees have significantly increased their usage of voice and video conferencing on Teams,” he says. “We’ve been able to sustain this massive spike in Teams usage without major issues because it’s being routed over the internet—leaving our VPN capacity for just necessary connections between users and our internal resources.”
There have been challenges, however, which began when the Microsoft’s employees in China started working from home.
“Unlike here at our headquarters and other worldwide locations, when our employees in China work remotely, everything they do goes exclusively through our VPN pipe,” Means says.
That meant 100 percent of the workload of employees in Shanghai and Beijing was suddenly going through already heavily used VPN gateways.
“It was almost an overnight phenomenon,” Means says. “We were suddenly seeing usage of 85 to 95 percent of our network bandwidth and our VPN capacity.”
Already tight before the spread of COVID-19 began, VPN was quickly becoming a bottleneck in China.
“We started asking ourselves a lot of questions,” Means says. “Can we handle the expected number of concurrent VPN sessions? How is bandwidth holding up for employees? What’s their experience like? Are they all being successful?”
Quick action was needed.
“We had data to answer all the questions, but what we didn’t have was a single pane of glass where we could quickly look at everything to see what was happening across the company’s infrastructure,” Means says. “And company leaders were trying to figure out how to respond to the crisis—they needed data from us, and they needed it quickly.”
The answer was to identify the data that mattered the most and aggregate it into a Microsoft Power BI dashboard, which the company now uses to track all its VPN systems as the COVID-19 situation evolves.
As for the offices in Shanghai and Beijing, Means’s team worked with local internet providers to increase VPN capacity by 50 percent so they had enough headroom to handle the new usage safely.
“That was a budget decision,” Means says. All they had to do was sign some contracts—no new hardware was needed. “Once we agreed that it was the right thing to do, we were able to remove that bottleneck in less than a day.”
[Explore using a Zero Trust strategy to secure Microsoft’s network during remote work. Unpack enhancing VPN performance at Microsoft. Discover how Microsoft Sentinel protects Microsoft from cybersecurity attacks.]
Investments in VPN infrastructure paying off
The notion of Microsoft’s employees and vendors frequently working remotely was daunting, but Means was confident that its VPN infrastructure would support that sudden spike in demand.
Three years ago, he would not have been so optimistic.
“We were in a tough spot a few years ago,” Means says. “We had multiple and complex reasons for why our employees’ end-to-end VPN experience wasn’t very strong—it was a complicated stack that had multiple potential failure points.”
The team ran into issues on the Windows side, there were challenges with the network, and the company was using several different VPN clients at once, which created confusion and complexity for employees. Means’s team worked closely with the Windows team, and through direct partnership and engagement, helped drive significant stability improvements in the Windows native VPN client.
“We saw a connectivity success rate in the 60 to 65 percent range, which is very low,” Means says. “That meant that a third of people would run into an issue every time they tried to work remotely.”
A fix was needed.
“We knew this could become a problem if we had a situation where we needed many of our employees to work remotely,” Means says. “So, we invested heavily in strengthening our VPN service by focusing on the user experience and partnering closely with internal teams.”
“We built the new system so it could support over 200,000 concurrent sessions,” Means says. “In an extreme situation, we could support that many people on VPN at the same time.”
Microsoft has 221,000 employees and a large contingent of vendors who work on the company’s network. They don’t all work at the same time, but the goal was to cover the worst-case scenario and to future-proof the solution.
“Across the world, we normally have about 55,000 employees connect via VPN on a given day,” Means says. “With everyone working remotely, that has climbed as high as 128,000 employees and vendors per day, including about 45,000 per day at our headquarters in Redmond.”
Previously, employees used a large number of gateways to access the company’s internal network, but many of those gateways provided poor connectivity.
“We consolidated the gateways to data centers and locations with reliable and plentiful bandwidth,” Means says. “This shrunk the number of gateway sites, but increased overall reliability and made it so we could handle more concurrent connections.”
The hybrid design that the team put together uses Microsoft Azure Traffic Manager to geolocate VPN users. “That allowed us to send them to their nearest gateway and to meet scale demands,” he says. “We used Azure Active Directory (AAD) to authenticate our users and to validate the status of their device before allowing them on VPN.”
The team also began using servers that can handle 30,000 or 60,000 users each, much more than the old servers that could only handle 750 to 2,000 users. “Theoretically, we could now handle 500,000 concurrent VPN connections worldwide,” Means says.
Means says the improvement in the company’s VPN service was substantial, so much so that employees forgot it was working behind the scenes when they worked remotely.
Despite being worked harder than ever before, the company’s VPN infrastructure is performing at a high level. “Knock on wood, there have been no major incidents,” Means says.
Importantly, VPN is allowing employees to get their work done.
“Today, even as many of our employees work remotely, our success rate is at 92 percent,” Means says. “That’s one of the highest rates we’ve ever recorded—the only reason it isn’t at 99 percent is because that number includes drops because of reboots during patch updates, getting disconnected from Wi-Fi, and home network or internet service provider issues.”
Employee productivity also has held strong.
“We measure employee productivity, and the productivity of our software engineers in particular,” Means says. “We look at pull requests, commits per day, and other indicators—so far, we haven’t seen any measurable drop in work performance.”
Means says the situation is creating a learning moment for his team.
“One thing that we’re learning is it’s really about the data,” he says. “There are so many things we can measure—finding the right things to measure so we can take the right actions is critical.”
The team’s data-centric approach to VPN and networking also has allowed it to make smart investments, like provisioning capacity only when required. It also helps the team respond quickly when needed—as was the case when Italy tightened its remote working restrictions.
“We doubled capacity in London, which is where we run the VPN connection for our employees in Italy,” Means says. “Having good data allows us to quickly take proactive action when needed and to stay ahead of the game at all times.”
The team also saw the potential for a bottleneck at its headquarters in Redmond, Washington, where the number of concurrent sessions that VPN needed to support was climbing close to capacity. The company addressed this concern by adding another VPN gateway.
“This has caused us to reflect on our readiness efforts overall,” Means says. “We’ve used this as an opportunity to improve how we do things.”
The team expects to keep learning and adding to the VPN capabilities.
Tips for retooling VPN at your company
For enterprises and organizations looking to optimize and scale out their VPN capabilities, some of the best practices shown above and recommended by Microsoft are:
- Consider saving the load on your VPN infrastructure by using split tunnel VPN, send networking traffic directly to the internet for “known good” and well-defined SaaS services like Teams and other Office 365 services, or optimally, by sending all non-corporate traffic to the internet if your security rules allow.
- Collect user connection and traffic data in a central location for your VPN infrastructure, use modern visualization services, like Power BI, to identify hot spots before they happen, and plan for growth.
- Utilize Azure Sentinel to organize log collections, including user connection and traffic data, in a central location for VPN infrastructure.
- If possible, use a dynamic and scalable authentication mechanism, like Azure Active Directory, to avoid the trouble of certificates and improve security using multi-factor authentication (MFA) if your VPN client is Active Directory aware, like the Azure OpenVPN client.
- Geographically distribute your VPN sites to match major user populations, use a geo-load balancing solution such as Azure Traffic Manager, to direct users to the closest VPN site and distribute traffic between your VPN sites.
Finally, and probably most important, know the limits of your VPN connection infrastructure and how to scale out in times of need. Things like total bandwidth possible, maximum concurrent user connections per device will determine when you’ll need to add more VPN devices.
If your devices are physical hardware having additional supply on-hand or a rapid supply chain source will be critical. For cloud solutions, knowing ahead of time how and when to scale will make the difference.
Azure offers a native highly-scalable VPN gateway, as well the most common third-party VPN and SDWAN network virtual appliances in the Azure Marketplace.
For more information on these and other Azure and Office network optimizing practices please see: