Optimizing network performance for Microsoft Office 365

Sep 20, 2018   |  

Woman using laptop in airport.

As the earliest adopter of Microsoft products, Microsoft Digital deployed Microsoft Office 365 across the company. To optimize network capacity and performance, we took time to create and implement strategic plans for network-related technologies. We used industry-leading performance and migration approaches and adopted cloud infrastructure services to successfully move our complex global environment to Office 365.

As the earliest adopter of Microsoft products, Microsoft Digital, formerly Microsoft Core Services Engineering and Operations (CSEO), began deploying Microsoft Office 365 in 2011. To optimize for network capacity and performance, they implemented strategic plans for network-related technologies. Microsoft Digital has continued to evolve industry-leading performance and migration approaches, and they have adopted cloud infrastructure services to promote a successful transition to Office 365.

Situation

After many years of investment in the on-premises network, Microsoft Digital and its internal customers were accustomed to a highly reliable connectivity experience with Microsoft Office products. When Microsoft Digital began planning and testing the move to cloud-based Office 365, they analyzed network infrastructure and processes to find potential performance issues before beginning the migration. This analysis was important to learn whether the existing infrastructure would support the demands of moving a large enterprise service to the cloud. And it was critical to maintaining the quality of service necessary for employee productivity in Office 365.

Putting migration to the test

To test the Office 365 migration, Microsoft Digital identified about 2,000 datacenter-hosted mailboxes to migrate to the cloud starting on a Friday night. At that time, Microsoft Exchange used caching to compensate for latency in mailboxes that were geographically distant from their users, and regular email synchronization to local mailboxes provided optimal performance. The initial mailbox migration was completed successfully over the weekend. On Monday morning, users logged in and their client machines began to synchronize through the Internet to the cloud all at once. The sudden demand overloaded the gateway to the Internet and caused an outage.

Valuable lessons were gained from this test, which have been applied to migration planning processes since then. A key lesson was that collaboration and communication between network and migration teams, working together on more extensive modeling or smaller-scale tests, might have revealed that the infrastructure could not support a 2,000-user migration. Having tightly integrated teams that can identify issues on multiple levels is the best way to avoid migration missteps.

Fortunately, the technology used during the test was slated to be replaced by new technology that could support the traffic and significantly increase bandwidth. This test experience accelerated that replacement and eventually allowed for successful continued mailbox migration.

Planning cloud service performance

The Exchange cloud migration experiment was the foundation for a broad, ongoing cloud performance initiative. An essential first step was to closely engage the network and infrastructure teams, who could identify the tools and strategies necessary for a major migration from on-premises servers to the cloud and allow Microsoft to take full advantage of the benefits of cloud services.

Traditional on-premises server systems, despite lacking the scalability of the cloud, had one advantage: network connectivity had been optimized and provisioned over many years, and any bottlenecks had been addressed. Before moving to Office 365, Microsoft used remote datacenters for many user locations. And, like most IT organizations, Microsoft Digital already had experience with explicit planning for network capacity, beyond simply laying the largest available cable between users and servers.

For Microsoft Digital to maintain the performance that users expect while migrating powerful applications such as SharePoint and Exchange to their cloud-based versions, it needed to ensure availability and connectivity.

During the migration, Microsoft Digital managed and addressed performance issues that users may have experienced by:

  • Planning for testing for appropriate network connectivity to the cloud.
  • Implementing thorough migration and performance planning for services such as SharePoint Online and Skype for Business Online.
  • Embracing new cloud infrastructure services such as ExpressRoute for Office 365.

Solution

When Microsoft Digital began large-scale migration to Office 365, readiness efforts included performing high-level capacity analyses, adding redundancy to ensure Internet availability, and optimizing connectivity for all users. And each Office 365 service presented unique migration challenges that had to be considered and planned for. SharePoint Online and Skype for Business are two examples of the diversity of the performance optimization experiences, efforts made, and lessons learned as part of the Office 365 migration. Microsoft Digital continues to serve as the company’s first and best customer today, piloting new cloud solutions that will offer even better network performance for Office services in the future.

Driving availability and connectivity through optimization

Teams within Microsoft Digital make broad and continual performance optimization efforts across the Office 365 suite of applications to enable a high level of employee productivity during and after migrations. These efforts include performing capacity planning calculations, providing redundancy and resiliency where appropriate, and creating the shortest path possible between the client and the cloud.

Calculating network capacity requirements

Enterprise employees place many demands on a network. Information workers, salespeople, and engineers all have different network utilization patterns and productivity needs. When Microsoft Digital is preparing to bring a new site online or relocate a team, a carrier services manager uses a generalized calculation to determine how much service a given location will need.

For example, capacity guidelines within Microsoft Digital were formerly 110 Kbps per sales person and 300 Kbps per developer. As services have become data-hungry and teams have become more widely dispersed, the typical user—regardless of job function—is now estimated to use about 400 Kbps of bandwidth during normal activity. Although this is a subjective guideline that may be affected by many factors (concentration of users, size of campus, remote access, non-user access, and so on), it is a practical starting point. Estimating initial capacity will ultimately reduce the level of investment needed to provide an acceptable level of service and satisfy business needs at that location.

Microsoft Digital has a policy to deploy an Internet edge stamp that can sustain the expected capacity demand for the next 18 months. The design can scale to double the capacity during the useful life of the hardware, which is typically three to five years. This relatively simple policy provides the advantage of being able to size the circuit (which may be owned by an external provider) up or down as needed when a team moves or its size changes.

Provisioning in this manner is much easier and less expensive than deploying more equipment and increasing the size of the edge later. This practice provides a great degree of agility as well as the ability to optimize connectivity for both cost and performance, with minimal complexity and low risk of outages.

By investing in thorough migration preparation, Microsoft Digital has seen a positive effect on the speed of the migration, availability of the service, and quality of the user experience. When planning for Office 365 migration, Microsoft Digital recommends investing the time to create profiles, calculate capacity needs, and build the network out in anticipation of these needs. Office 365 has published capacity-planning tools to help customers size their own bandwidth needs (see the For More Information section).

Providing Internet redundancy for performance and availability

In locations where Internet connectivity is critical, such as operations centers where employees must work on site with no option for remote work, Microsoft Digital introduced circuit redundancy by providing more than one physical connection to the site via different carriers. If one carrier service fails, a secondary carrier can provide backup service. This redundancy is critical in business climates that rely heavily on cloud productivity services like Office 365, and where Internet connection failures result in reduced employee efficiency.

Microsoft Digital also uses global network redundancies for alternative routing in case of disaster. This strategy was tested in 2011, when a 9.0-magnitude earthquake and subsequent tsunami in Japan brought down power and severed network connections with the west coast of the United States for several days. The existing network redundancy allowed Microsoft Digital to route around the severed connection to reach other worldwide destinations through unaffected redundant regional connections.

Most enterprise networks were not built to optimize the flow of traffic from local intranets to services on the Internet. In many IT organizations, intranet performance is still prioritized over Internet performance. For large organizations planning a migration to Office 365 cloud services, a prioritized focus on Internet performance and availability, with increased emphasis on Internet connectivity and redundancy, is important to a successful transition.

Optimizing remote connectivity to Office services

As the suite of Office-related services expanded and were more heavily used for productivity purposes by their employee base, Microsoft Digital did a networking “reality check,” comparing connectivity methods based on user location. This process involved closely examining the data traffic patterns of Exchange and SharePoint services. Although the traffic patterns varied significantly with each service, the fundamental connectivity optimization made by Microsoft Digital improved network performance across each service and improved overall user productivity.

During this optimization effort, Microsoft employees in major campuses and large office buildings were connected through the corporate intranet, which has reliable and robust private networking.

Similarly, smaller Microsoft branch offices (Internet-connected clients) previously connected to the corporate intranet via a leased line or a persistent VPN, so their local connectivity was an extension of the corporate intranet, with no on-site edge. This was a suboptimal experience for users accessing Internet-based Office services.

For example, the closest hub to a sales office in New York might be in North Carolina; to reach the Internet, traffic would first have to travel from New York to the corporate intranet in North Carolina. Microsoft Digital improved connectivity in such situations by creating an Internet edge at these branch sites, which gave them direct Internet access.

To further increase the efficiency and improve the user experience, Microsoft Digital allowed users to use Internet path even if they were simultaneously connected to the intranet via VPN or other remote access solutions. This was accomplished via a “split tunneling” configuration. All these measures set up a client connectivity model that was ready for the move to the Internet-delivered public cloud service that is Office 365.

Direct paths to the Internet typically required advanced data loss prevention measures. This usually involved integrating advanced client security protection, such as antivirus and antimalware safeguards, Windows Firewall, and cloud-based security. Before migrating to Office 365, Microsoft Digital had to examine all data as it left the internal network. With Office 365, however, the destination cloud services scan and analyze files to determine whether they violate any policies, and traffic can safely travel, for example, from a home office to the Internet to the service without the added security measure of sending it through the managed edge.

Microsoft Digital strives to further remove the dependency of apps on the corporate network and use Azure or Office 365 as the default access points. The corporate network serves a significantly reduced role in the modern app infrastructure, and a reduction in corporate network saves costs. Microsoft Digital calls this initiative Internet First.

The current network design routes traffic through central gateways, creating additional latency and a less optimal user experience. The future network design will route traffic directly to the internet, reducing latency and reliance on private networks.
Figure 1. Current and future network design

Optimizing SharePoint performance

Microsoft Digital focused its performance optimization efforts for SharePoint Online on two major areas: a gradual, staged migration plan that mitigated most impacts of migration on performance, and a SharePoint portal performance analysis that led to important configuration optimizations in caching, content rendering, and navigation. Because of these efforts, Microsoft Digital enjoyed an especially smooth migration of SharePoint content and portals to Office 365.

Optimizing portals with performance tuning

For on-premises SharePoint portals whose size and complexity require a complete redesign for optimal migration to Office 365, portal performance in the cloud may be affected by conditions that did not exist in on-premises environments. The migration and major redesign of the Microsoft internal employee portal, MSW, offers a real-world illustration of these challenges.

For example, when the new MSW portal first went into testing on Office 365 with the same web parts from on-premises, pages took about 20 seconds to load—too long. Microsoft Digital discovered that half of this delay was caused by navigation issues, and the other half was caused by content query work. (There were over seven content query web parts on the portal’s home page when initial testing began.)

The page-loading issue was quickly determined to be caused by expensive server-side rendering that did not benefit from the same cache profiles on Office 365 that existed on-premises. It was resolved by switching to metadata-managed navigation and using the new Content Search Web Part.

The MSW portal is shown in SharePoint online.
Figure 2. The new MSW portal running in SharePoint Online

Any large portal redesign in SharePoint Online requires performance tuning. A few of the performance considerations for the MSW portal redesign are highlighted in the following paragraphs. Many more considerations for page-loading optimization can be found in the article Tune SharePoint Online performance (see the For More Information section).

Caching. Moving portals to Office 365 involves transitioning from an on-premises model with a few dedicated machines hosting an entire service to a shared, multi-tenant model with many machines hosting many workloads. When MSW was hosted on-premises, four front-end servers were dedicated to handling user requests.

Generally, those servers all had MSW in cache, so users experienced good performance. SharePoint Online, however, uses orders of magnitude more front-end servers shared across all workloads and sites within the customer’s tenancy. The cache is also shared across many customers with different data to cache, so any cache is short-lived on any particular front-end server and is less likely to contain the specific desired portal pages.

Relying on object caching was, therefore, not an effective way to ensure an optimized user experience for MSW in SharePoint Online. Microsoft recommends avoiding dependency on SharePoint Online front-end server caches by using other approaches to performance optimization that do not rely on object caching, including use of the Content Search Web Part and metadata-managed navigation.

Content Search Web Part. In the on-premises implementation, MSW used the Content Query Web Part (CQWP) to write dynamically rendered content. However, with the reduced dependency on caching in Office 365 came reduced performance with use of the CQWP. Any server-side work that was necessary to generate a page would not be cached, causing a performance decrease in Office 365. To restore performance, MSW replaced the CQWP with the Content Search Web Part (CSWP), to quickly deliver results to the user by retrieving and rendering data independently of the server. Using the CSWP resulted in significantly better page loading performance in SharePoint Online and was a major factor in making the portal responsive in the cloud.

Navigation. Because caching should not be used in a shared front-end server model, structural navigation can be problematic for complex site structures in Office 365. When MSW was migrated to the cloud, it initially retained the structural navigation of its on-premises implementation. It quickly became apparent that this was affecting performance due to reliance on front-end server caching.

Because MSW did not require navigation security trimming (the ability to hide navigational links to restricted files), Microsoft Digital decided to switch to managed navigation, which provided a substantial performance benefit in SharePoint Online. Office 365 offers three navigation choices: structural, managed, and search-driven. If security trimming is required and the site has a simple structure, structural navigation is still a viable option.

For sites that require security trimming and have a more complex structure, search-driven navigation (which requires customization of the master page but provides a fast load time and locally cached navigation structure) may be considered. Simply put, choosing the appropriate navigation option for the needs of the site can greatly improve site performance.

Optimizing Skype for Business performance

When Microsoft Digital began its transition to Office 365, the team responsible for Lync and Skype for Business services was already involved with a major performance improvement effort as part of the transition from Lync to Skype for Business. This work included categorizing service challenges and large-scale, long-term planning for improved performance and availability both on-premises and in the cloud. This improvement project expanded to include an intense evaluation of the cloud management service and strategic work to prepare the network environment and optimize for the cloud, as well as a cloud migration plan that took advantage of flexible hybrid opportunities.

Preparing the network environment

Knowing that a Skype for Business cloud migration would require changes to the network environment for optimal performance, Microsoft Digital took advantage of the Microsoft Click-to-Run technology to reduce complexity and IT overhead, allowing Office 365 to manage Office and Skype for Business client updates.

By moving to the cloud, Microsoft Digital was able to manage updates and ensure the most current versions of the client at all times, guaranteeing availability of the newest features and the greatest reliability.

Because real-time communication is extremely sensitive to network conditions, Microsoft Digital also prioritized a deep understanding of three key elements of capacity and traffic planning before they began cloud migrations. To understand capacity and traffic planning:

  • They analyzed federated traffic with external organizations in a hybrid environment to prevent potential bottlenecks at the network edge.
  • They developed a deep understanding of the traffic flows within the network to optimize routes for voice traffic.
  • They ensured that their private connectivity, which reduced complexity in the network integration for Skype for Business in Office 365, had the appropriate markings for quality of service and guaranteed prioritization to the Office 365 network.

Historically, major IT investments have included tools, systems, and personnel for managing infrastructure and applications; moving to the cloud shifted some of those burdens to Office 365 and enabled Microsoft Digital to focus more resources on adoption and improving control over the Skype for Business user experience. Microsoft Digital has seen fewer incidents caused by network changes, because dedicated network links now connect users directly to server farms in the cloud. In Office 365, the risk of user or service impact caused by internal network changes or configuration drift is greatly reduced.

Optimization for transition to the cloud

Because of the real-time nature of the Skype for Business service, optimizing performance is even more critical than with other Office 365 services; even a few seconds of lost voice, video, or data affect user productivity. Therefore, before Microsoft Digital could migrate Skype for Business to the cloud, it was crucial to evaluate change and develop new strategies for availability, reliability, and performance.

When Microsoft Digital began to transition Skype for Business to the cloud, the existing wireless networks were optimized for data, but not for real-time communications such as voice. With the increase in the number and variety of mobile devices in the workplace, use of wireless connections more than doubled during meetings in less than a year. Additionally, transitioning to open floor plans to reduce physical footprints and accommodate new working models resulted in increased user density and additional meeting spaces.

To accommodate channel overlap and improve signal optimization in this changing wireless environment, Microsoft Digital re-tuned their wireless access point placements and deployment configurations based on analysis of changing user behaviors, varied user density, and new floor plan trends.

At the same time, Microsoft Digital was seeing widespread increase in Windows 8 machines that were optimized for Wireless N network hardware rather than wired connections. Microsoft Digital standardized the environment for wireless N, ensuring clear communications by proactively making sure that its wireless network drivers were as current as possible and continuing to actively push driver updates.

Deploying private, managed connections with Azure ExpressRoute

Microsoft Digital uses Microsoft ExpressRoute for Office 365, the same technology used for Microsoft Azure. ExpressRoute provides Microsoft Digital with private network connectivity that offers performance that is more predictable and guaranteed service availability.

A standard public Internet connection can be an uncertain and unpredictable network path in which service quality depends on carriers, traffic, intermediaries, and proximity to cloud datacenters. With ExpressRoute, organizations contract with a Microsoft partner who is a network service provider or an Exchange provider. These companies provide connectivity into the Microsoft network, which connects all Microsoft datacenters, offering predictable performance, data privacy, and guaranteed service availability.

Although ExpressRoute is being used by Microsoft Digital, ExpressRoute is not required or recommended for Office 365 customers except in a small number of situations.

These situations include a) regulatory requirements that would mandate a direct network connection or b) following a required customer network assessment for Skype for Business voice and video when network deficiencies are discovered that ExpressRoute can address. In the situations where ExpressRoute for Office 365 is implemented, Microsoft should be directly involved to ensure a successful implementation.

Best practices

  • Plan for Internet capacity requirements before migration.
  • Migrate one Office 365 service at a time.
  • Treat your Internet network connection as critically as you would treat your network connection to on-premises datacenters.
  • Deploy an Internet edge stamp that can scale to double the expected capacity during the life of the hardware.
  • Plan for pilot testing, troubleshooting, and optimization.
  • Use migration as an opportunity to carefully evaluate and prioritize what should be migrated. Mitigate risks by not migrating lower priority data.
  • Assess on-premises SharePoint page navigation models that rely on caching for performance and evaluate the appropriate navigation model for optimization.
  • Use hybrid environments where appropriate to manage migration.
  • Use private network connections with Azure ExpressRoute for Office 365 to address the performance uncertainties of an Internet-connected network path.

Conclusion

Microsoft Digital has been planning and carrying out Office 365 migrations since 2011, and the stories shared here are just a few examples of the achievements made and lessons learned along the way. Together, these experiences support a single, essential message: investing the time and effort necessary to implement thorough and strategic planning for network connectivity results in fewer migration complications and better overall performance. Of course, IT organizations will inevitably need to perform some degree of additional optimization and troubleshooting work before, during, or after migration. Guidance is available for performance optimization and troubleshooting for all phases of Office 365 migration.