Revealing DRAM Operating GuardBands Through Workload-Aware Error Predictive Modeling

Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computers Vol. 70; no. 11; pp. 1976 - 1987
Main Authors Mukhanov, Lev, Tovletoglou, Konstantinos, Vandierendonck, Hans, Nikolopoulos, Dimitrios S., Karakonstantis, Georgios
Format Journal Article
LanguageEnglish
Published New York IEEE 01.11.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that these margins are too pessimistic and will become impractical due to high-power costs, especially in future DRAM technologies. In this article, we present a new technique for automatic scaling the DRAM refresh period under reduced supply voltage that minimizes the probability of failures. The main idea behind the proposed approach is that DRAM error behavior is workload-dependent and can be predicted based on particular program inherent features. We use a Machine Learning (ML) method to build a workload-aware DRAM error behavior model based on the program features which we extract from real workloads during our DRAM error characterization campaign. With such a model, we identify the marginal value of the DRAM refresh period under relaxed voltage for each DRAM module of a server that enable us to reduce the DRAM power. We implement a temperature-driven OS governor which automatically sets the module-specific marginal DRAM parameters discovered by the ML model. Our governor reduces the DRAM power by 24 percent on average while minimizing the probability of failures. Unlike previous studies, our technique: i) does not require intrusive changes to hardware; ii) is implemented on a real server; iii) uses a mechanism that prevents any abnormal DRAM error behavior; and iv) can be easily deployed in data centers.
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2020.3033627