Understanding and Mitigating Packet Corruption in Data Center Networks

  • Danyang Zhuo ,
  • Manya Ghobadi ,
  • Ratul Mahajan ,
  • Klaus-Tycho Forster ,
  • Arvind Krishnamurthy ,
  • Thomas Anderson

Proceedings of SIGCOMM ’17, Los Angeles, CA, USA, August 21–25, 2017 |

Published by ACM SIGCOMM

Publication

We take a comprehensive look at packet corruption in data center networks, which leads to packet losses and application performance degradation. By studying 350K links across 15 production data centers, we find that the extent of corruption losses is significant and that its characteristics differ markedly from congestion losses. Corruption impacts fewer links than congestion, but imposes a heavier loss rate; and unlike congestion, corruption rate on a link is stable over time and is not correlated with its utilization. Based on these observations, we developed CorrOpt, a system to mitigate corruption. To minimize corruption losses, it intelligently selects which corrupting links can be safely disabled, while ensuring that each top-of-rack switch has a minimum number of paths to reach other switches. CorrOpt also recommends specific actions (e.g., replace cables, clean connectors) to repair disabled links, based on our analysis of common symptoms of different root causes of corruption. Our recommendation engine has been deployed in over seventy data centers of a large cloud provider. Our analysis shows that, compared to current state of the art, CorrOpt can reduce corruption losses by three to six orders of magnitude and improve repair accuracy by 60%.