Revealing DRAM Operating GuardBands Through Workload-Aware Error Predictive Modeling
Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that...
Saved in:
Published in | IEEE transactions on computers Vol. 70; no. 11; pp. 1976 - 1987 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.11.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that these margins are too pessimistic and will become impractical due to high-power costs, especially in future DRAM technologies. In this article, we present a new technique for automatic scaling the DRAM refresh period under reduced supply voltage that minimizes the probability of failures. The main idea behind the proposed approach is that DRAM error behavior is workload-dependent and can be predicted based on particular program inherent features. We use a Machine Learning (ML) method to build a workload-aware DRAM error behavior model based on the program features which we extract from real workloads during our DRAM error characterization campaign. With such a model, we identify the marginal value of the DRAM refresh period under relaxed voltage for each DRAM module of a server that enable us to reduce the DRAM power. We implement a temperature-driven OS governor which automatically sets the module-specific marginal DRAM parameters discovered by the ML model. Our governor reduces the DRAM power by 24 percent on average while minimizing the probability of failures. Unlike previous studies, our technique: i) does not require intrusive changes to hardware; ii) is implemented on a real server; iii) uses a mechanism that prevents any abnormal DRAM error behavior; and iv) can be easily deployed in data centers. |
---|---|
ISSN: | 0018-9340 1557-9956 |
DOI: | 10.1109/TC.2020.3033627 |