Boosting Microsoft’s migration to the cloud with Microsoft Azure

Oct 27, 2023   |  

Microsoft Digital stories[Editor’s note: This content was written to highlight a particular event or moment in time. Although that moment has passed, we’re republishing it here so you can see what our thinking and experience was like at the time.]

When Microsoft set out to move its massive internal workload of 60,000 on-premises servers to the cloud and to shutter its handful of sprawling datacenters, there was just one order from company leaders looking to go all-in on Microsoft Azure.

Please start our migration to the cloud, and quickly.

As a team, we had a lot to learn. We started with a few Azure subscriptions. We were kicking the tires, figuring things out, assessing how much work we had to do.

– Pete Apple, principal service engineer, Microsoft Digital

However, it was 2014, the early days of moving large, deeply rooted enterprises like Microsoft to the cloud. And the IT pros in charge of making it happen had few tools to do it and little guidance on how to go about it.

“As a team, we had a lot to learn,” says Pete Apple, a principal service engineer in Microsoft Digital. “We started with a few Azure subscriptions. We were kicking the tires, figuring things out, assessing how much work we had to do.”

As it turns out, quite a bit of work. More on that in a moment.

Now, seven years later, the company’s migration to the cloud is 96 percent complete and the list of lessons learned is long. Six IT datacenters are no more and there are fewer than 800 on-prem servers left to migrate. And that massive workload of 60,000 servers? Using a combination of modern engineering to redesign the company’s applications and to prune unused workloads, that number has been reduced. Microsoft is now running on 7,474 virtual machines in Azure and 1,567 virtual machines on-premises.

“What we’ve learned along the way has been rolled into the product,” Apple says. “We did go through some fits and starts, but it’s very smooth now. Our bumpy experience is now helping other companies have an easier time of it (with their own migrations).”

[Learn how modern engineering fuels Microsoft’s transformation. Find out how leaders are approaching modern engineering at Microsoft.]

The beauty of a decision framework

It didn’t start that way, but migrating a workload to Azure inside Microsoft is super smooth now, Apple says. He explains that everything started working better when they began using a decision tree like the one shown here.

Microsoft Digital’s migration to the cloud decision tree

A flow-chart graphic that takes the reader through the decisions the CSEO cloud migration team had to make each time it proposed moving an internal Microsoft workload to the cloud.
The cloud migration team used this decision tree to guide it through migrating the company’s 60,000 on-premises servers to the cloud. (Graphic by Marissa Stout | Inside Track)

First, the Microsoft Digital migration team members asked themselves, “Are we building an entirely new experience?” If the answer was “yes,” then the decision was easy. Build a modern application that takes full advantage of all the benefits of building natively in the cloud.

If you answer “no, we need to move an existing application to the cloud,” the decision tree is more complex. It requires the team to answer a couple of tough questions.

Do you want to take the Platform as a Service (PaaS) approach? Do you want to rebuild your experience from the ground up to take full benefit of the cloud? (Not everyone can afford to take the time needed or has the budget to do this.) Or do you want to take the Infrastructure as a Service (IaaS) approach? This requires lifting and shifting with a plan to rebuild in the future when it makes more sense to start fresh.

Tied to this question were two kinds of applications: those built for Microsoft by third-party vendors, and those built by Microsoft Digital or another team in Microsoft.

On the third-party side, flexibility was limited—the team would either take a PaaS approach and start fresh, or it would lift and shift to Azure IaaS.

“We had more choices with the internal applications,” Apple says, explaining that the team divvied those up between mission-critical and noncritical apps.

For the critical apps, the team first sought money and engineering time to start fresh and modernize. “That was the ideal scenario,” Apple says. If money wasn’t available, the team took an IaaS approach with a plan to modernize when feasible.

As a result, noncritical projects were lifted and shifted and left as-is until they were no longer needed. The idea was that they would be shut down once something new could be built that would absorb that task or die on the vine when they become irrelevant.

“In a lot of cases, we didn’t have the expertise to keep our noncritical apps going,” Apple says. “Many of the engineers who worked on them moved onto other teams and other projects. Our thinking was, if there is some part of the experience that became important again, we would build something new around that.”

Getting migration right

Pete Apple sits at his desk in his office, gesturing with his hands as he makes a point to someone
When Microsoft started its migration to the cloud, the company had a lot to learn, says Pete Apple, a principal service engineer in Microsoft Digital. That migration is nearly finished and those learnings? “They have been rolled into the product,” Apple says. (Photo by Jim Adams | Inside Track)

Apple says the Microsoft Digital migration team initially thought the migration to the cloud would be as simple as implementing one big lift-and-shift operation. It was a common mindset at the time: Take all your workloads and move them to the cloud as-is and figure out the rest later.

“That wasn’t the best way, for a number of reasons,” he says, adding that there was a myriad of interconnections and embedded systems to sort out first. “We quickly realized our migration to the cloud was going to be far more complex than we thought.”

After a lot of rushing around, the team realized it needed to step back and think more holistically.

The first step was to figure out exactly what they had on their hands—literally. Microsoft had workloads spread across more than 10 datacenters, and no one was tracking who owned all of them or what they were being used for (or if they were being used at all).

Longtime Microsoft culture dictated that you provision whatever you thought you might need, and to go big to make sure you covered your worst-case scenario. Once the upfront cost was covered, teams would often forget about how much it cost to keep all those servers running. With teams spinning up production, development, and test environments, the amount of untracked capacity was large and always growing.

“Sometimes, they didn’t even know what servers they were using,” Apple says. “We found people who were using test environments to run their main services.”

And figuring out who was paying for what? Good luck.

“There was a little bit of cost understanding, of what folks were thinking they had versus what they were paying for, that we had to go through,” Apple says. “Once you move to Azure, every cost is accounted for—there is complete clarity around everything that you’re paying for.”

There were some surprising discoveries.

“Why are we running an entire Exchange Server with only eight people using it? That should be on Office 365,” Apple says. “There were a lot of ‘let’s find an alternative and just retire it’ situations that we were able to work through. It was like when you open your storage facility from three years ago and suddenly realize you don’t need all the stuff you thought you needed.”

Moving to the cloud created opportunities to do many things over.

“We were able to clean up many of our long-running sins and misdemeanors,” Apple says. “We were able to fix the way firewalls were set up, lock down our ExpressRoute networks, and (we) tightened up access to our Corpnet. Moving to the cloud allowed us to tighten up our security in a big way.”

Essentially, it was a greenfield do-over opportunity.

“We didn’t do it enough, but when we did it the right way, it was very powerful,” says Heather Pfluger. She is a partner group manager on Microsoft Digital’s Platform Engineering Team, who had a front-row seat during the migration.

That led to many mistakes, which makes sense because the team was trying to both learn a new technology and change decades of ingrained thinking.

“We did dumb things,” Pfluger says. “We definitely lifted and shifted into some financial challenges, we didn’t redesign as we should have, and we didn’t optimize as we should have.”

All those were learning moments, she says. She points to how the team now uses an optimization dashboard to buy only what it needs. It’s a change that’s saving Microsoft millions of dollars.

Apple says those new understandings are making a big difference all over the company.

“We had to get people into the mindset that moving to the cloud creates new ways to do things,” he says. “We’re resetting how we run things in a lot of ways, and it’s changing how we run our businesses.”

He rattled off a long list of things the team is doing differently, including:

  • Sending events and alerts straight to DevOps teams versus to central IT operations
  • Spinning up resources in minutes for just the time needed. (Versus having to plan for long racking times or VMs that used to take a week to manually build out.)
  • Dynamically scale resources up and down based upon load
  • Resizing month-to-month or week-to-week based upon cyclical business rhythms versus using the old “continually running” model
  • Having some solutions costs drop to zero or near zero when idle
  • Moving from custom Windows operating system image for builds to using Azure gallery image and Azure automation to update images
  • Creating software defined networking configurations in the cloud versus physical networked firewalled configurations that required many manual steps
  • Managing on premises environments with Azure tools

There is so much more we can do now. We don’t want our internal users to find problems with our reporting. We want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.

– Heather Pfluger, partner group manager, Platform Engineering Team

Pfluger’s team builds the telemetry tools Microsoft employees use every day.

“There is so much more we can do now,” she says, explaining that the goal is always to improve satisfaction. “We don’t want our internal users to find problems with our reporting. We want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.”

And it’s starting to work.

“We’ve gotten to the point where our employee users discovering a problem is becoming more rare,” Pfluger says. “We’re getting better, but we still have a long way to go.”

Apple hopes everyone continues to learn, adjust, and find better ways to do things.

“All of our investments and innovations are now all occurring in the cloud,” he says. “The opportunity to do new and more powerful things is just immense. I’m looking forward to seeing where we go next.”

Related links

Tags: , , , , ,