High-Performance Tensor-Train Primitives Using GPU Tensor Cores

Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-wor...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computers Vol. 73; no. 11; pp. 2634 - 2648
Main Authors	Liu, Xiao-Yang, Hong, Hao, Zhang, Zeliang, Tong, Weiqin, Kossaifi, Jean, Wang, Xiaodong, Walid, Anwar
Format	Journal Article
Language	English
Published	IEEE 01.11.2024
Subjects	Big Data GPU tensor cores Graphics processing units Libraries Machine learning algorithms Matrix converters Matrix decomposition matrix product state quantum machine learning tensor layer Tensor-train Tensors
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called Density Matrix Renormalization Group (DMRG) . In performance evaluations, our third-order TT tensor decomposition achieves up to <inline-formula><tex-math notation="LaTeX">3.34\times</tex-math> <mml:math display="inline"><mml:mn>3.34</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq1-3441831.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">6.91\times</tex-math> <mml:math display="inline"><mml:mn>6.91</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq2-3441831.gif"/> </inline-formula> speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of <inline-formula><tex-math notation="LaTeX">5.01\times</tex-math> <mml:math display="inline"><mml:mn>5.01</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq3-3441831.gif"/> </inline-formula> over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of <inline-formula><tex-math notation="LaTeX">65.3\times</tex-math> <mml:math display="inline"><mml:mn>65.3</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq4-3441831.gif"/> </inline-formula> at the cost of <inline-formula><tex-math notation="LaTeX">0.3\%</tex-math> <mml:math display="inline"><mml:mn>0.3</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><inline-graphic xlink:href="liu-ieq5-3441831.gif"/> </inline-formula> drop in accuracy and a speedup of <inline-formula><tex-math notation="LaTeX">1.53\times</tex-math> <mml:math display="inline"><mml:mn>1.53</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq6-3441831.gif"/> </inline-formula> over a PyTorch implementation on CUDA cores. The optimized DMRG algorithm achieves up to a speedup of <inline-formula><tex-math notation="LaTeX">14.0\times</tex-math> <mml:math display="inline"><mml:mn>14.0</mml:mn><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq7-3441831.gif"/> </inline-formula> over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2024.3441831