TetriX: Flexible Architecture and Optimal Mapping for Tensorized Neural Network Processing

The continuous growth of deep neural network model size and complexity hinders the adoption of large models in resource-constrained platforms. Tensor decomposition has been shown effective in reducing the model size by large compression ratios, but the resulting tensorized neural networks (TNNs) req...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computers Vol. 73; no. 5; pp. 1219 - 1232
Main Authors	Zhang, Jie-Fang, Lu, Cheng-Hsun, Zhang, Zhengya
Format	Journal Article
Language	English
Published	New York IEEE 01.05.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Co-design Complexity Complexity theory Compression ratio Computational modeling Computer architecture Computers Decomposition Hardware Image coding Mapping Mathematical analysis Matrix decomposition Network latency neural network accelerator Neural networks tensor contraction sequence search Tensor decomposition tensorized neural network (TNN) Tensors Workload Workloads
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The continuous growth of deep neural network model size and complexity hinders the adoption of large models in resource-constrained platforms. Tensor decomposition has been shown effective in reducing the model size by large compression ratios, but the resulting tensorized neural networks (TNNs) require complex and versatile tensor shaping for tensor contraction, causing a low processing efficiency for existing hardware architectures. This work presents TetriX, a co-design of flexible architecture and optimal workload mapping for efficient and flexible TNN processing. TetriX adopts a unified processing architecture to support both inner and outer product. A hybrid mapping scheme is proposed to eliminate complex tensor shaping by alternating between inner and outer product in a sequence of tensor contractions. Finally, a mapping-aware contraction sequence search (MCSS) is proposed to identify the contraction sequence and workload mapping for achieving the optimal latency on TetriX. Remarkably, combining TetriX with MCSS outperforms the single-mode inner-product and outer-product baselines by up to 46.8<inline-formula><tex-math notation="LaTeX">\boldsymbol{\times}</tex-math> <mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhang-ieq1-3365936.gif"/> </inline-formula> in performance across the collected TNN workloads. TetriX is the first work to support all existing tensor decomposition methods. Compared to a TNN accelerator designed for the hierarchical Tucker method, TetriX achieves improvements of 6.5<inline-formula><tex-math notation="LaTeX">\boldsymbol{\times}</tex-math> <mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhang-ieq2-3365936.gif"/> </inline-formula> and 1.1<inline-formula><tex-math notation="LaTeX">\boldsymbol{\times}</tex-math> <mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhang-ieq3-3365936.gif"/> </inline-formula> in inference throughput and efficiency, respectively.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2024.3365936