Bringing Microsoft’s commerce platform to Microsoft Azure

Male employee holding a tablet with both hands. He is standing in front of a metal warehouse rack filled with packages.
Microsoft’s system for verifying and recording transactions received a major boost in performance and reliability after migrating to Microsoft Azure.

Microsoft Digital technical storiesFor almost 20 years, our Microsoft’s Commerce Transaction Platform (CTP) processed online payments through an on-premises environment, verifying that all transactions had been processed, sales had been finalized, and revenue reported. Commerce & Ecosystems (C&E), the team who manages the CTP, had an important question to answer—should they continue refreshing and building out the on-premises infrastructure or take the big step towards digital transformation and migrate the platform into Microsoft Azure?

The decision was made to bring CTP into the cloud, a change that meant we at Microsoft would see better performance, improved reliability, new monitoring capabilities, and an ability to scale in a cost-optimized way.

[Read more about boosting Microsoft’s transaction platform by migrating to Microsoft Azure. Explore moving Microsoft’s financial reporting processes to Microsoft Azure. Discover modernizing enterprise integration services using Azure.]

Microsoft’s eCommerce runs on our CTP

In order for our online business to grow, the CTP has to be available and reliably processing requests.

Most purchases of Microsoft Azure, Microsoft Office 365, Microsoft Dynamics 365, and several other consumer and commercial services, are powered by the CTP. If the platform goes offline, revenue loss is measured in thousands of dollars per second. In addition to recording online orders, the system is responsible for billing subscriptions.

Like most on-premises environments, our CTP follows a traditional refresh cycle, typically driven by warranty lifecycles. As machines and hardware are due to fall out of warranty, the C&E team evaluates their infrastructure, projects future needs, and research replacement options before systematically changing out the machines. This refresh cycle takes a minimum of six months, with the C&E team being careful not to disturb or disrupt the commerce platform.

In keeping up with the processing and storage needs of the CTP, C&E ends up purchasing bigger, faster, and more expensive hardware with each refresh cycle. The CTP runs on over 700 machines and stores over six petabytes of data in over 100 databases, relying heavily on the use of the Microsoft Distributed Transaction Coordinator (MSDTC), which is responsible for coordinating transactions across multiple resources. This makes replacement a major task. However, each refresh is also an opportunity to identify a better path forward.

A diagram showing the relationship of active and passive data centers which store and verify transactions.
Our CTP includes a network of storage devices to record and verify transactions. This improves response time and availability of data, but also made it difficult to move away from an on-premises environment.

Time to move to the cloud

When C&E was considering Microsoft Azure, it was already very popular at an enterprise level. Microsoft Azure is highly robust, introduces more flexibility, more computing options, and would have a lower maintenance cost for C&E. The team also had a vocal cadre of engineers throwing support behind the cloud platform, who were all eager to work on the latest technology.

Scaling was also on the table. In the on-premises environment, C&E had been required to procure enough machines to handle high volume surges, even though this capacity was an intermittent need. This meant a large number of physical machines would need to be procured to accommodate occasional spikes, only to remain dormant during low-traffic periods. Unlike an on-premises environment, Microsoft Azure can spin machines up and down as needed. This cost-efficient method for balancing out high and low system volume also meant C&E could procure and decommission virtual machines (VMs) in a matter of minutes, not months.

Considering these factors, the cost-benefit of renewing the on-premises machines versus moving to Microsoft Azure reached a tipping point, and Microsoft Azure was coming out on top.

Finding the right infrastructure for our CTP

For the migration to Microsoft Azure to be successful, C&E would need the cloud service to match or exceed the growing performance and storage needs of the transaction platform. This meant carefully examining the options available to the team, trying to identify the right approach while still being cost-aware.

Because of the need to scale up performance, the demand on Microsoft Azure machines would be high. Several brand-new virtual machine series had just been released, and they met those performance requirements, but C&E was reluctant to be one of the first customers. Time was not on their side; however, the warranties for CTP’s on-premises machines would be expiring soon. In the end, moving to Microsoft Azure was more important and C&E decided to act.

PaaS or IaaS?

Before C&E could migrate the CTP to Microsoft Azure, they had an important tech decision to make: would they use Platform-as-a-Service (PaaS) or Infrastructure-as-a-Service (IaaS)?

PaaS was the preferred option, especially after doing a feature analysis. In PaaS, our CTP would have more flexibility and an easy environment to operate in. Additionally, PaaS would require less maintenance, making it an improvement over the on-premises infrastructure.

But some of the legacy services needed for CTP to process transactions didn’t easily fit into PaaS. The team had to think through how specific our CTP needs would be addressed.

  • The CTP uses availability groups for providing high availability services
  • Transactional replication separates the front- end load from the back end one
  • MSDTC provides consistency for transactions spanning across multiple databases
  • Some database instances are bigger than 30 terabytes

This pushed C&E towards IaaS, which was closer to the on-premises environment. With IaaS, the team could have more direct control over operating systems and utilize native SQL features to support our CTP. This also meant there would be more moving pieces to manage.

Having settled on IaaS, C&E began evaluating the various performance needs across CTP’s different functions. With this information in hand, the team could begin work on finding the right service tier for their needs. Several options existed, but the primary candidates were:

  • Microsoft Azure SQL – Managed Instance
  • Microsoft Azure SQL – Virtual Machines
  • Microsoft Azure SQL – Hyperscale

In the end, C&E decided on Microsoft Azure SQL -Virtual Machines. With huge processing needs and a major requirement for scaling, Microsoft Azure SQL – Virtual Machines proved to be the best fit.

Selecting the right machine

C&E would need a high input/output operations per second (IOPS) and a fair amount of network throughput to support the CTP’s more demanding components.

The team immediately began testing different virtual machines against the on-premises environment, evaluating how each option performed compared to physical machines. Three tests were conducted, with each scenario representing a different process demand our CTP might require.

The tests sent different workloads through the virtual machines, starting with small block sizes, moving towards progressively larger requests. This benchmark helped C&E determine that the largest virtual machine, the M-series, would be needed to sustain some of their performance needs. However, the M-series was over-capacity for some of our CTP’s processes, making it an irresponsible choice.

Fortunately, being in IaaS gave them flexibility, allowing C&E to assign different processes to the appropriate virtual machine. The M-series would be used for anything that required high IOPS and throughput, the rest of CTP could function on the E-series.

Two tables, one breaking down Microsoft Azure VM specifications, another showing how machines performed in three benchmarking scenarios.
Microsoft Azure presented several different options for C&E. In order to meet all of our CTP’s performance requirements, the team performed several tests against usage scenarios.

Storing data in Microsoft Azure

C&E had been using a storage area network (SAN) infrastructure for storage. This hardware network is expensive to purchase and replace, but the benefit is high performance specifications—such as less than a millisecond response time—and improved availability. The team needed an equivalent in Microsoft Azure and narrowed it down to two candidates: ultra disks or premium SSDs. The ultra disks were the fastest option, and closely resembled a SAN, but were far more expensive. After testing, however, the premium SSDs matched the patterns of C&E’s existing SAN.

Dedicated to the cloud

Before migration could begin, C&E had to determine if CTP would be a dedicated host or an isolated virtual machine. In the end, they used both.

The M-series machines were needed for meeting a few core processing and throughput functions. However, they are only available as isolated virtual machines, which limited the types of machines available. Since the M-series were over-capacity for most of our CTP’s needs, C&E had to come up with a blended approach.

By running their virtual machines on a single-tenant server, Microsoft Azure Dedicated Host (ADH), C&E could mix and match the size of their virtual machines, a necessity for the custom virtual machine arrangement.

Being on ADH also posed some challenges. The C&E team would need to patch their own systems to align with their storage management approach. It also meant the team would have to select which regions and availability zones they would configure to. Fortunately, C&E understood how to configure the CTP in ADH without giving up availability or performance.

Splitting the platform between isolated and ADH, C&E could easily set up and manage the environment correctly, using the M-series to handle some of the high processing functions and a mix of machines on ADH for CTP’s other operating needs.

Managing a hybrid migration

With a new Microsoft Azure-based infrastructure in mind, C&E was able to begin work on moving our CTP over to the cloud.

C&E would take a hybrid approach to migration, relying on SQL to create a seamless transition between on-premises and Microsoft Azure. This side-by-side approach meant C&E could gradually shift our CTP away from a legacy environment without disturbing the growing business. It also enabled the team to compare customer experiences between the two environments and verify that Microsoft Azure was giving users the same results. This careful approach allowed C&E to strategically shut down on-premises machines and let the cloud take ownership of transaction processes.

Divide and conquer

Four primary components make up the CTP:

  • Online services. A layer exists between the UI/UX and the rest of the CTP. When a user clicks through their purchase, our CTP’s online systems interpret this signal as an input to verify the transaction.
  • Backend processing. The system responsible for handling subscription renewals in batches. This backend system triggers on a billing date, not a user input, to begin verification.
  • Data storage. Every transaction needs to be recorded somewhere. C&E relied on a powerful SAN with a very fast response time to reliably record transactions.
  • Revenue recognition. Without a way to recognize a transaction as reported revenue, then our entire CTP fails. The revenue recognition system supports that crucial process.

In segmenting our CTP into four operational segments, C&E was able to develop their migration strategy around systematically moving the platform into Microsoft Azure one component at a time. It also allowed the team to evaluate different performance needs, configuring their new environment to meet or exceed requirements. Each on-premises component was mapped to a suitable corresponding service in Microsoft Azure.

Making the move

Migrating our CTP’s four components required careful coordination, with the difficulty varying based on visibility and the ease at which the feature could be tested. A few of the services had shared infrastructure and overlapping components, which helped ease the transfer from on-premises to the cloud. Online services and revenue recognition, for example, were straightforward lift-and-shifts that were easy to test, as the team had immediate feedback if something wasn’t working.

Before shutting down the on-premises components related to backend processing, C&E had to fully mimic what was happening in a pre-production environment. A verification path was built between the two, which allowed C&E to slowly move backend processing jobs from on-premises to Microsoft Azure. It was a simple process for C&E, but they were rigorous with testing and examination to ensure no side effects were swallowed up by the move. This ultimately led to revamping the validation infrastructure to be suitable for Microsoft Azure.

Over six petabytes of data needed to be moved from on-premises SQL servers to the cloud. This was achieved by adding Microsoft Azure IaaS machines as secondary to existing SQL AlwaysOn Availability Group clusters, then migrating over components one by one. Initially, Microsoft Azure served as the primary during normal online transaction processing (OLTP) traffic, but once C&E was confident in the migration, Microsoft Azure became primary for backend job processing tasks requiring higher CPU, disk IOPS, and throughput as well.

Microsoft Azure Resource Management (ARM) templates were used to carefully control how objects moved from on-premises to the cloud. ARM enabled the team to easily provision, modify, and delete resources. We also copied backups from on-premises to Microsoft Azure, restoring them within virtual machines to establish a data sync. This enabled a seamless failover approach. When it was time to turn off the on-premises systems, C&E was confident that they had made the right decision.

Life in the cloud

For almost 20 years, C&E had relied on a variety of on-premises machines to run our CTP. If something went wrong, there was someone in their organization dedicated to solving the problem. By moving to Microsoft Azure, C&E doesn’t need to dedicate time and resources to troubleshooting—the cloud team does that for you. Still, the paradigm shift took some time to get used to.

Pressure testing in Microsoft Azure demonstrated that there was no data loss or inaccurate figures when users engaged the new cloud-based infrastructure. The Microsoft Azure team was responsive, engaging closely with C&E, carefully scrutinizing details to make sure the migration was a success.

The complexity of a hybrid environment was ultimately the biggest challenge, but it was a requirement of the migration. Now that our CTP is in Microsoft Azure, those issues are a thing of the past.

Cost-efficiencies for CTP

In addition to offloading infrastructure management costs to the Microsoft Azure team, C&E has actualized savings through system optimization and the elimination of hardware maintenance. On-premises servers and hardware continue to increase in price, but that burden has been offloaded.

Additionally, C&E is seeing savings in operational costs. While the team initially opted out of some upfront savings to get a better system, they’re finding ways to optimize processes to introduce cost savings.

It’s also important to note that services available through the native platform have reduced C&E’s dependence on third-party platforms.

Better performance and reliability

With the on-premises environment, C&E would experience a few issues each month. Automation and self-healing functions inside Microsoft Azure have reduced the frequency of disruptions significantly.

Microsoft Azure’s strong SLAs have better guarantees than C&E’s on-premises equivalent, giving our CTP a reliable foundation to operate on. The platform also benefits from improved monitoring capabilities made available through Microsoft Azure, giving the team greater visibility into what’s happening inside our CTP.

New features are on their way

Thanks to the native service features available to Microsoft Azure, C&E now has access to new features that can be quickly deployed, working right out of the box.

A path to PaaS

While the initial migration required C&E to utilize IaaS, the seamlessness of SQL means that our CTP can eventually be moved to a PaaS environment, as the team initially envisioned. This will introduce more flexibility, giving the team an even easier service to manage.

An easy way to scale

C&E can now double or reduce the number of virtual machines in a matter of minutes. Not only does this speed their response to high volume loads, it does so in a cost-optimized way.

Importantly, the C&E team didn’t need to downscale their needs. Microsoft Azure matched them.

Key Takeaways

In the end, C&E was able to secure better performance, improved reliability, scalability, and much more for our CTP by migrating to Microsoft Azure. A refresh cycle used to take six months or more, now it’s a matter of weeks. Decisions can be made quickly and with confidence, as the native environment allows features to work out of the box. As our CTP becomes more optimized within Microsoft Azure, new savings will be uncovered along with more performance opportunities.

Related links

Recent