RAMBDA: RDMA-driven Acceleration Framework for Memory-intensive us-scale Datacenter Applications

Yifan Yuan; Jinghan Huang; Yan Sun; Tianchen Wang; Jacob Nelson; Dan R. K. Ports; Yipeng Wang; Ren Wang; Charlie Tai; Nam Sung Kim

RAMBDA: RDMA-driven Acceleration Framework for Memory-intensive us-scale Datacenter Applications

Yifan Yuan ,
Jinghan Huang ,
Yan Sun ,
Tianchen Wang ,
Jacob Nelson ,
Dan R. K. Ports ,
Yipeng Wang ,
Ren Wang ,
Charlie Tai ,
Nam Sung Kim

International Symposium on High Performance Computer Architecture (HPCA '23) | February 2023

Download BibTex

Responding to the “datacenter tax” and “killer microseconds” problems for memory-intensive datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes RAMBDA, RDMA-driven acceleration framework for Boosting performance of memory-intensive us-scale datacenter applications. this paper proposes RAMBDA, a holistic network and architecture co-design solution RAMBDA leverages current RDMA and emerging cache-coherent off-chip interconnect technologies and consists of the following four hardware and software components: (1) unified abstraction of inter- and intra-machine communications synergistically managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly interacting with NIC; and (4) adaptive device-to-host data transfer for modern server memory systems comprising both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype RAMBDA with a commercial system and evaluate three popular datacenter applications: (1) in-memory key-value store, (2) chain replication-based distributed transaction system, and (3) deep learning recommendation model inference. The evaluation shows that RAMBDA provides 30.1-69.1% lower latency, up to 2.5x higher throughput, and ~3x higher energy efficiency than the current state-of-the-art solutions.