Machine Learning Gets Big Boost from Ultra-Efficient Convolutional Neural Network Accelerator

Published February 23, 2015

Share this page

Posted by Doug Burger

Editor’s note: Doug Burger (opens in new tab), a processor architect by training, is a Microsoft researcher focused on disrupting the very fabric of datacenter processing power in a mobile-first, cloud-first world.

Doug Burger I’m excited to highlight a breakthrough in high-performance machine learning from Microsoft researchers.

Before describing our results, some background may be helpful. The high-level architecture of datacenter servers has been generally stable for many years, based on some combination of CPUs, DRAM, Ethernet, and disks (with solid-state drives a more recent addition). While the capacities and speeds of the components—and the datacenter scale—have grown, the basic server architecture has evolved slowly. This slow evolution is likely to change, however, as the decelerating gains from silicon scaling are opening the door to more radical changes in datacenter architecture.

The opportunities for exciting changes will only grow when Moore’s Law ends and the industry experiences successively larger waves of disruption. My personal view is that the end of Moore’s Law is less than one product design cycle away, perhaps three years. In 2011, we started a project (Catapult (opens in new tab)) to begin migrating key portions of our software cloud services onto programmable hardware (i.e. FPGAs (opens in new tab)), hoping that such a platform would allow cloud service performance to keep improving, once the silicon scaling hits a wall, by migrating successively larger portions from software into programmable hardware. It took us three iterations of prototyping our architecture (building custom boards each time) to find one that worked across our cloud.

As many of you know, last June we unveiled our Catapult platform at ISCA 2014 (opens in new tab), showing that we successfully accelerated Bing’s web search ranking algorithms with a novel FPGA fabric running over 1,632 servers in one of our datacenters. This design placed one Microsoft-designed FPGA board in each server, and tightly coupled the FPGAs within a rack using a specialized, low-latency network arranged in a 6×8 2-D torus. The platform enabled Bing to run web ranking with roughly half the number of servers than previously. As a result, Bing subsequently announced that they will go to production with the Catapult acceleration later this year.

Since then, in addition to our Bing production efforts, our team within Microsoft Research has been working on accelerating a number of other key strategic workloads for the company using reconfigurable logic. Microsoft’s efforts in machine learning have created exciting new capabilities for our products and customers, including Bing, Cortana, One Drive, Skype Translator, and Microsoft Band, to name a view. For instance, convolutional neural network (CNN) based techniques have been widely used by our colleagues to push the boundaries of computer vision processing, such as image recognition (opens in new tab). Today, I’m delighted to highlight an exciting white paper we just released (opens in new tab). The paper describes work by Eric Chung and his MSR colleagues on an ultra-efficient convolutional neural network (CNN) accelerator.

Eric and the team (including Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, and Karin Strauss) hand-crafted a CNN design in reconfigurable logic using a Stratix-V FPGA. On well-known image classification tests such as ImageNet 1K and ImageNet 22K, we exceeded the performance of previous FPGA designs by about three times (classifying at rates of 134 and 91 images/second, respectively). Additionally, we provided a significant boost in terms of images/joule over medium and high-end GPUs optimized for the same problem. This result enables our datacenter servers to offer image classification at lower cost and higher energy efficiency than can be provided by medium to high-end GPUs.

We are currently mapping our engine to Altera’s new Arria 10 FPGA (opens in new tab). The Arria 10 is significant as it has hardened support for floating-point operations, and is able to run at over a Teraflop with high energy efficiency; Altera estimates that the floating point throughput will reach 3X the energy efficiency of a comparable GPU. We expect great performance and efficiency gains from scaling our CNN engine to Arria 10, conservatively estimated at a throughput increase of 70% with comparable energy used. We thus anticipate that the new Arria 10 parts will enable an even higher level of efficiency and performance for image classification within Microsoft’s datacenter infrastructure.

See also: