The Code That No One in the Cloud Can Live Without

Published

Posted by Rob Knies

Parikshit Gopalan, Jin Li, Sergey Yekhanin, and Cheng Huang (opens in new tab)

A couple of years ago, a few Microsoft researchers published a couple of interesting papers on storage efficiencies. Now, with breathtaking speed, the concepts in those papers have been embraced across the cloud-computing world.

microsoft research podcast

What’s Your Story: Weishung Liu

Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.

Technological change can occur at lightning speed. Parikshit Gopalan (opens in new tab), Cheng Huang (opens in new tab), and Sergey Yekhanin (opens in new tab) can testify to that.

In November 2012, Gopalan, Huang, and Yekhanin, along with Huseyin Simitci of Windows Azure Storage (now Microsoft Azure (opens in new tab) Storage), had their paper On the Locality of Codeword Symbols (opens in new tab), published in IEEE Transactions on Information Theory.

Erasure-Coding Theory Paper Gains Acclaim

During ISIT 2014 (opens in new tab), the IEEE International Symposium on Information Theory, being held June 29-July 4 in Honolulu, the authors of that paper received the IEEE Communications Society & Information Theory Society Joint Paper Award (opens in new tab). The honor goes to outstanding papers published in a publication of the Communications Society or the Information Theory Society within the previous three calendar years.

The winning paper is an in-depth theoretical study of relations between code parameters needed for data-storage applications. Erasure Coding in Windows Azure Storage (opens in new tab)—an earlier systems paper written by Microsoft’s Huang, Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Gopalan, Jin Li (opens in new tab), and Yekhanin—pointed the way for a new method to achieve more efficient storage in the cloud. It, too, garnered plenty of attention, winning a best-paper award during the 2012 USENIX Annual Technical Conference (opens in new tab). The co-authors of the Erasure Coding paper also earned a Microsoft Technical Community Network Storage Technical Achievement Award in 2013 for outstanding achievement and contribution to Microsoft software technology.

Code Requirements

“The project started with Cheng and Jin Li having the idea that Azure might benefit from adopting some new kind of erasure codes,” Yekhanin says. “Parikshit and I joined the team. Together, we developed an abstract mathematical framework that captures the requirements for codes that arise in distributed storage applications.

“The key differences from classical coding-theory setup is that here, we want codes that provide ‘locality’—the ability to recover lost data quickly in typical failure scenarios. We designed codes, later adopted by Azure, and also proved that our codes are optimal in a certain strict mathematical sense.”

Gopalan, who will be traveling to Honolulu to accept the Joint Paper Award, provides a bit more detail.

“This is very much a theory paper,” he says. “Its main contribution is a lower bound, saying that any code with certain properties needs to have certain length. But it arose from a very practical setting: We were trying to show that a particular code we had suggested to Azure was optimal.

“It is gratifying that a lower bound—a result saying that you cannot do any better than something—should have such practical significance. The results there turn out to have real predictive power in telling us what properties are achievable by code and what are not. It certainly helped in our interactions with product groups to be able to say that our constructions are provably optimal.”

Local Reconstruction Codes

Erasure coding (opens in new tab), a powerful math tool that reduces the space required to store data, relies on shortened descriptions of data for reassembly and delivery to users. Local Reconstruction Codes (opens in new tab) (LRC) enable quicker data reconstructions, and the result is reduced time and costs for data retrieval.

Local Reconstruction Codes have been adopted throughout all Microsoft storage production lines, from the cloud to enterprise and the desktop. It was first deployed in Azure Storage in 2012, to great acclaim. In 2013, LRC also shipped with Windows Server 2012 R2 (opens in new tab) and Windows 8.1 (opens in new tab).

Huang cites great collaboration with partners from Windows storage teams.

“We are very lucky to work with fantastic business partners, the Azure Storage team and the Windows Storage Spaces team. LRC wouldn’t have gone anywhere without them taking a leap of faith and making their contributions.”

At first, erasure coding was a solution for a problem that didn’t exist. But when it eventually did …

“It took a long time to bear fruit, from research to production,” Huang says. “When we started exploring this direction and published earlier papers in 2007, there was very little interest from business groups. Literally, every team we talked to told us that disks were getting bigger and cheaper every day. Redundancy could be easily achieved with replication, and there was no need to bother with erasure coding.

“It is very telling to see how cloud computing has completely turned the world around in several years. Now, the industry is at a point that no one in the cloud business—not only Microsoft, but also Amazon, Facebook, Google, and others—can be competitive or even survive without erasure coding.”

Continue reading

See all blog posts