Measuring the Impact of Memory Errors on Application Performance

  • Mark Gottscho ,
  • Bikash Sharma ,
  • Puneet Gupta ,
  • Shuayb Zarar ,
  • Sriram Govindan ,
  • Di Wang

IEEE Computer Architecture Letters (CAL) |

Memory reliability is a key factor in the design of warehouse scale computers. Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may cause an “avalanche” of errors to occur on affected hardware. We expose how the hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5. For an interactive web-search workload,average query latency degrades by up to 2.3 for a light traffic load, and up to an extreme 3746 under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance consistency by mitigating the worst-case behavior on faulty hardware.