February 19, 2018 - February 20, 2018

Swiss Joint Research Center Workshop 2018

Location: Lausanne, Switzerland

PI: Gustavo Alonso, ETH Zurich; Co-PI: Ken Eguro, Microsoft Research

While in the first phase of the project we explored the efficient implementation of data processing operators in FPGAs as well as the architectural issues involved in the integration of FPGAs as co-processors in commodity servers, in this new proposal we intend to focus on architectural aspects of in-network data processing. The choice is motivated by the growing gap between the bandwidth and very low latencies that modern networks support and the overhead of ingress and egress from VMs and applications running on conventional CPUs. A first goal is to explore the type of problems and algorithms that can be best run as the data flows through the network so as to be able to exploit the bare wire speed and allow off-loading of expensive computations to the FPGA. A second, but not less important goal, is to explore how to best operate FPGA based accelerators when directly connected to the network and operating independently from the software part of the application. In terms of applications, the focus will remain on data processing (relational, No-SQL, data warehouses, etc.) with the intention of starting to move towards machine learning algorithms at the end of the two-year project. On the network side, the project will work on developing networking protocols suitable to this new configuration and how to combine the network stack with the data processing stack.
PI: Otmar Hilliges, ETH Zurich; Co-PI: Marc Pollefeys, Microsoft and ETH Zurich

Micro-aerial vehicles (MAVs) have been made accessible to end-users via the emergence of simple to use hardware and programmable software platforms and have seen a surge in consumer and research interest as a consequence. Clearly there is a desire to use such platforms in a variety of application scenarios but manually flying quadcopters remains a surprisingly hard task even for expert users. More importantly, state-of-the-art technologies offer only very limited support for users who want to employ MAVs to reach a certain high-level goal. This is maybe best illustrated by the currently most successful application area – that of aerial videography. While manual flight is hard, piloting and controlling a camera simultaneously is practically impossible. An alternative to manual control is offered via waypoint based control of MAVs, shielding novices from the underlying complexities. However, this simplicity comes at the cost of flexibility and existing flight planning tools are not designed with high-level user goals in mind.

Building on our own (MSR JRC funded) prior work, we propose an alternative approach to robotic motion planning. The key idea is to let the user work in solution-space – instead of defining trajectories the user would define what the resulting output should be (e.g., shot composition, transitions, area to reconstruct). We propose an optimization-based approach that takes such high-level goals as input and generates the trajectories and control inputs for a gimbal mounted camera automatically. We call this solution-space driven, inverse kinematic motion planning. Defining the problem directly in the solution space removes several layers of indirection and allows users to operate in a more natural way, focusing only on the application specific goals and the quality of the final result, whereas the control aspects are entirely hidden.
PIs: Thomas Hofmann and Aurélien Lucchi, ETH Zurich; Co-PI: Sebastian Nowozin, Microsoft Research

The past decade has seen a growth in application of big data and machine learning systems. Probabilistic models of data are theoretically well understood and in principle provide an optimal approach to inference and learning from data. However, for richly structured data domains such as natural language and images, probabilistic models are often computationally intractable and/or have to make strong conditional independence assumptions to retain computational as well as statistical efficiency. As a consequence, they are often inferior in predictive performance, when compared to current state-of-the-art deep learning approaches. It is a natural question to ask, whether one can combine the benefits of deep learning with those of probabilistic models. The major conceptual challenge is to define deep models that are generative, i.e. that can be thought of as models of the underlying data generating mechanism.

We thus propose to leverage and extend recent advances in generative neural networks to build rich probabilistic models for structured domains such as text and images. The extension of efficient probabilistic neural models will allow us to represent complex and multimodal uncertainty efficiently. To demonstrate the usefulness of the developed probabilistic neural models we plan to apply them to challenging multimodal applications such as creating textual descriptions for images or database records.
PIs: Onur Mutlu and Luca Benini, ETH Zurich; Co-PI: Derek Chiou, Microsoft

Today’s systems are overwhelmingly designed to move data to computation. This design choice goes directly against key trends in systems and technology that cause performance, scalability and energy bottlenecks:
- data access from memory is a key bottleneck as applications become more data-intensive and memory bandwidth and energy do not scale well,
- energy consumption is a key constraint in especially mobile and server systems,
- data movement is very costly in terms of bandwidth, energy and latency, much more so than computation.
Our goal is to comprehensively examine the premise of adaptively performing computation near where the data resides, when it makes sense to do so, in an implementable manner and considering multiple new memory technologies, including 3D-stacked memory and non-volatile memory (NVM). We will examine practical hardware substrates and software interfaces to accelerate key computational primitives of modern data-intensive applications in memory, runtime and software techniques that can take advantage of such substrates and interfaces. Our special focus will be on key data-intensive applications, including deep learning, neural networks, graph processing, bioinformatics (DNA sequence analysis and assembly), and in-memory data stores. Our approach is software/hardware cooperative, breaking the barriers between the two and melding applications, systems and hardware substrates for extremely efficient execution, while still providing efficient interfaces to the software programmer.
Opens in a new tab
PI: Florin Dinu, EPFL; Co-PIs: Christos Gkantsidis and Sergey Legtchenko, Microsoft Research

The goal of our project is to improve the utilization of server resources in data centers. Our proposed approach was to attain a better understanding of the resource requirements of data-parallel applications and then incorporate this understanding into the design of more informed and efficient data center (cluster) schedulers. While pursuing these directions we have identified two related challenges that we believe hold the key towards significant additional improvements in application performance as well as cluster-wide resource utilization. We will explore these two challenges as a continuation of our project. These two challenges are: Resource inter-dependency and time-varying resource requirements. Resource inter-dependency refers to the impact that a change in the allocation of one server resource (memory, CPU, network bandwidth, disk bandwidth) to an application has on that application’s need for the other resources. Time-varying resource requirements refers to the fact that over the lifetime of an application its resource requirements may vary. Studying these two challenges together holds the potential for improving resource utilization by aggressively but safely collocating applications on servers.
PI: Babak Falsafi, EPFL; Co-PI: Stavros Volos, Microsoft Research

Near-memory processing (NMP) is a promising approach to satisfy the performance requirements of modern datacenter services at a fraction of modern infrastructure’s power. NMP leverages emerging die-stacked DRAM technology, which (a) delivers high-bandwidth memory access, and (b) features a logic die, which provides the opportunity for dramatic data movement reduction – and consequently energy savings – by pushing computation closer to the data. In the precursor to this project (the MSR European PhD Scholarship), we evaluated algorithms suitable for database join operators near memory. We showed, while sort join has been conventionally thought of as inferior to hash join in performance for CPUs, near-memory processing favors sequential over random memory access, making sort join superior in performance and efficiency as a near-memory service. In this project, we propose to answer the following questions:
- What data-specific functionality should be implemented near memory (e.g., data filtering, data reorganization, data fetch)?
- What ubiquitous, yet simple system-level functionality should be implemented near memory (e.g., security, compression, remote memory access)?
- How should the services be integrated with the system (e.g., how does the software use them)?
- How do we employ near-threshold logic in near-memory processing?
Opens in a new tab
PIs: Babak Falsafi and Martin Jaggi, EPFL; Co-PI: Eric Chung, Microsoft Research

Deep Neural Networks (DNNs) have emerged as algorithms of choice for many prominent machine learning tasks, including image analysis and speech recognition. In datacenters, DNNs are trained on massive datasets to improve prediction accuracy. While the computational demands for performing online inference in an already trained DNN can be furnished by commodity servers, training DNNs often requires computational density that is orders of magnitude higher than that provided by modern servers. As such, operators often use dedicated clusters of GPUs for training DNNs. Unfortunately, dedicated GPU clusters introduce significant additional acquisition costs, break the continuity and homogeneity of datacenters, and are inherently not scalable.

FPGAs are appearing in server nodes either as daughter cards (e.g., Catapult) or coherent sockets (e.g., Intel HARP) providing a great opportunity to co-locate inference and training on the same platform. While these designs enable natural continuity for platforms, co-locating inference and training on a single node faces a number of key challenges. First, FPGAs inherently suffer from low computational density. Second, conventional training algorithms do not scale due to inherent high communication requirements. Finally, co-location may lead to contention requiring mechanisms to prioritize inference over training.

In this project, we will address these fundamental challenges in DNN inference/training co-location on servers with integrated FPGAs. Our goals are:
- Redesign training and inference algorithms to take advantage of DNNs inherent tolerance for low precision operations.
- Identify good candidates for hard-logic blocks for the next generations of FPGAs.
- Redesign DNN training algorithms to aggressively approximate and compress intermediate results, to target communication bottlenecks and scale the training of single networks to an arbitrary number of nodes.
- Implement FPGA-based load balancing techniques in order to provide latency guarantees for inference tasks under heavy loads and enable the use of idle accelerator cycles to train networks when operating under lower loads.
Opens in a new tab
PIs: Pascal Fua and Mathieu Salzmann, EPFL; Co-PIs: Debadeepta Dey, Ashish Kapoor, Sudipta Sinha, Microsoft Research

Several companies are now launching drones that autonomously follow and film their owners, often by tracking a GPS device they are carrying. This holds the promise to fundamentally change the way in which drones are used by allowing them to bring back videos of their owners performing activities, such as playing sports, unimpeded by the need to control the drone. In this project, we propose to go one step further and turn the drone into a personal trainer that will not only film but also analyse the video sequences and provide advice on how to improve performance. For example, a golfer could be followed by such a drone that will detect when he swings and offer advice on how to improve the motion. Similarly, a skier coming down a slope could be given advice on how to better turn and carve. In short, the drone would replace the GoPro-style action cameras that many people now carry when exercising. Instead of recording what they see, it would film them and augment the resulting sequences with useful advice. To make this solution as lightweight as possible, we will strive to achieve this goal using the on-board camera as the sole sensor and free the user from the need to carry a special device that the drone locks onto. This will require:
- Detecting the subject in the video sequences acquired by the drone so as to keep him in the middle of its field of view. This must be done in real-time and integrated into the drone’s control system.
- Recovering the subject’s 3D pose as he moves from the drone’s videos. This can be done with a slight delay since the critique only has to be provided once the motion has been performed.
- Providing feedback. In both the golf and ski cases, this would mean quantifying leg, hips, shoulders, and head position during a swing or a turn, offering practical suggestions on how to change them, and showing how an expert would have performed the same action.
Opens in a new tab
PIs: Rachid Guerraoui and Georgios Chatzopoulos, EPFL; Co-PI: Aleksandar Dragojevic, Microsoft Research

Modern hardware trends have changed the way we build systems and applications. Increasing memory (DRAM) capacities at reduced prices make keeping all data in-memory cost-effective, presenting opportunities for high performance applications such as in-memory graphs with billions of edges (e.g. Facebook’s TAO). Non-Volatile RAM (NVRAM) promises durability in the presence of failures, without the high price of disk accesses. Yet, even with this increase in inexpensive memory, storing the data in the memory of one machine is still not possible for applications that operate on TB of data, and systems need to distribute the data and synchronize accesses among machines.

This project proposes the design and building of support for high-level transactions on top of modern hardware platforms, using the Structured Query Language (SQL). The important question to be answered is whether transactions can get the maximum benefit of these modern networking and hardware capabilities, while offering a significantly easier interface for developers to work with. This project will require both research in the transactional support to be offered, including the operations that can be efficiently supported, as well as research in the execution plans for transactions in this distributed setting.
PIs: Michael Kapralov and Ola Svensson, EPFL; Co-PIs: Yuval Peres, Nikhil Devanur and Sebastien Bubeck, Microsoft Research

The task of grouping data according to similarity is a basic computational task with numerous applications. The right notion of similarity often depends on the application and different measures yield different algorithmic problems.

The goal of this project is to design faster and more accurate algorithms for fundamental clustering problems such as the k-means problem, correlation clustering and hierarchical clustering. We propose to perform a fine grained study of these problems and design algorithms that achieve optimal trade-offs between approximation quality, runtime and space/communication complexity, making our algorithms well-suited for modern data models such as streaming and MapReduce.

Swiss Joint Research Center Workshop 2018

Data Science with FPGAs in the Data Center

Human-Centric-Flight II: End-user Design of High-level Robotic Behavior

Tractable by Design

Enabling Practical, Efficient and Large-Scale Computation Near Data to Improve the Performance and Efficiency of Data Center and Consumer Systems

Towards Resource-Efficient Data Centers

Near-Memory System Services

Coltrain: Co-located Deep Learning Training and Inference

From Companion Drones to Personal Trainers

Revisiting Transactional Computing on Modern Hardware

Fast and Accurate Algorithms for Clustering