A Disaggregate Data Collecting Approach for Loss-Tolerant Applications

6th Asia-Pacific Workshop on Networking (APNet '22) |

Datacenter generates operation data at an extremely high rate, and data center operators collect and analyze them for problem diagnosis, resource utilization improvement, and performance optimization. However, existing data collection methods fail to efficiently aggregate and store data at extremely high speed and scale. In this paper, we explore a new approach that leverages programmable switches to aggregate data and directly write data to the destination storage. Our proposed data collection system, ALT, uses programmable switches to control NVMe SSDs on remote hosts without the involvement of a remote CPU. To tolerate loss, ALT uses an elegant data structure to enable efficient data recovery when retrieving the collected data. We implement our system on a Tofino-based programmable switch for a prototype. Our evaluation shows that ALT can saturate SSD’s peak performance without any CPU involvement.