Measuring the Impact of Memory Errors on Application Performance

Mark Gottscho; Bikash Sharma; Puneet Gupta; Shuayb Zarar; Sriram Govindan; Di Wang

Measuring the Impact of Memory Errors on Application Performance

Mark Gottscho ,
Bikash Sharma ,
Puneet Gupta ,
Shuayb Zarar ,
Sriram Govindan ,
Di Wang

IEEE Computer Architecture Letters (CAL) | August 2016

Download BibTex

Memory reliability is a key factor in the design of warehouse scale computers. Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may cause an “avalanche” of errors to occur on affected hardware. We expose how the hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5. For an interactive web-search workload,average query latency degrades by up to 2.3 for a light traffic load, and up to an extreme 3746 under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance consistency by mitigating the worst-case behavior on faulty hardware.

© IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.