High-Performance Tensor-Train Primitives Using GPU Tensor Cores

Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-wor...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computers Vol. 73; no. 11; pp. 2634 - 2648
Main Authors Liu, Xiao-Yang, Hong, Hao, Zhang, Zeliang, Tong, Weiqin, Kossaifi, Jean, Wang, Xiaodong, Walid, Anwar
Format Journal Article
LanguageEnglish
Published IEEE 01.11.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called Density Matrix Renormalization Group (DMRG) . In performance evaluations, our third-order TT tensor decomposition achieves up to <inline-formula><tex-math notation="LaTeX">3.34\times</tex-math> <mml:math display="inline"><mml:mn>3.34</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq1-3441831.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">6.91\times</tex-math> <mml:math display="inline"><mml:mn>6.91</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq2-3441831.gif"/> </inline-formula> speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of <inline-formula><tex-math notation="LaTeX">5.01\times</tex-math> <mml:math display="inline"><mml:mn>5.01</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq3-3441831.gif"/> </inline-formula> over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of <inline-formula><tex-math notation="LaTeX">65.3\times</tex-math> <mml:math display="inline"><mml:mn>65.3</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq4-3441831.gif"/> </inline-formula> at the cost of <inline-formula><tex-math notation="LaTeX">0.3\%</tex-math> <mml:math display="inline"><mml:mn>0.3</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><inline-graphic xlink:href="liu-ieq5-3441831.gif"/> </inline-formula> drop in accuracy and a speedup of <inline-formula><tex-math notation="LaTeX">1.53\times</tex-math> <mml:math display="inline"><mml:mn>1.53</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq6-3441831.gif"/> </inline-formula> over a PyTorch implementation on CUDA cores. The optimized DMRG algorithm achieves up to a speedup of <inline-formula><tex-math notation="LaTeX">14.0\times</tex-math> <mml:math display="inline"><mml:mn>14.0</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq7-3441831.gif"/> </inline-formula> over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2024.3441831