Revealing DRAM Operating GuardBands Through Workload-Aware Error Predictive Modeling

Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computers Vol. 70; no. 11; pp. 1976 - 1987
Main Authors	Mukhanov, Lev, Tovletoglou, Konstantinos, Vandierendonck, Hans, Nikolopoulos, Dimitrios S., Karakonstantis, Georgios
Format	Journal Article
Language	English
Published	New York IEEE 01.11.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Benchmark testing Data centers DRAM Dynamic random access memory energy consumption Errors Failure Feature extraction GuardBands Integrated circuit reliability low-power electronics Machine learning Memory management Modules Prediction models Predictive models Random access memory reliability Servers Storage capacity Voltage Workload
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that these margins are too pessimistic and will become impractical due to high-power costs, especially in future DRAM technologies. In this article, we present a new technique for automatic scaling the DRAM refresh period under reduced supply voltage that minimizes the probability of failures. The main idea behind the proposed approach is that DRAM error behavior is workload-dependent and can be predicted based on particular program inherent features. We use a Machine Learning (ML) method to build a workload-aware DRAM error behavior model based on the program features which we extract from real workloads during our DRAM error characterization campaign. With such a model, we identify the marginal value of the DRAM refresh period under relaxed voltage for each DRAM module of a server that enable us to reduce the DRAM power. We implement a temperature-driven OS governor which automatically sets the module-specific marginal DRAM parameters discovered by the ML model. Our governor reduces the DRAM power by 24 percent on average while minimizing the probability of failures. Unlike previous studies, our technique: i) does not require intrusive changes to hardware; ii) is implemented on a real server; iii) uses a mechanism that prevents any abnormal DRAM error behavior; and iv) can be easily deployed in data centers.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2020.3033627