RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper...

Full description

Saved in:

Bibliographic Details
Published in	2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp. 790 - 803
Main Authors	Ke, Liu, Gupta, Udit, Cho, Benjamin Youngjae, Brooks, David, Chandra, Vikas, Diril, Utku, Firoozshahian, Amin, Hazelwood, Kim, Jia, Bill, Lee, Hsien-Hsin S., Li, Meng, Maher, Bert, Mudigere, Dheevatsa, Naumov, Maxim, Schatz, Martin, Smelyanskiy, Mikhail, Wang, Xiaodong, Reagen, Brandon, Wu, Carole-Jean, Hempstead, Mark, Zhang, Xuan
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2020
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, providing up to 9.8 \times memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2 \times throughput improvement and 45.8% memory energy savings.
DOI:	10.1109/ISCA45697.2020.00070