TetriX: Flexible Architecture and Optimal Mapping for Tensorized Neural Network Processing

The continuous growth of deep neural network model size and complexity hinders the adoption of large models in resource-constrained platforms. Tensor decomposition has been shown effective in reducing the model size by large compression ratios, but the resulting tensorized neural networks (TNNs) req...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computers Vol. 73; no. 5; pp. 1219 - 1232
Main Authors Zhang, Jie-Fang, Lu, Cheng-Hsun, Zhang, Zhengya
Format Journal Article
LanguageEnglish
Published New York IEEE 01.05.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The continuous growth of deep neural network model size and complexity hinders the adoption of large models in resource-constrained platforms. Tensor decomposition has been shown effective in reducing the model size by large compression ratios, but the resulting tensorized neural networks (TNNs) require complex and versatile tensor shaping for tensor contraction, causing a low processing efficiency for existing hardware architectures. This work presents TetriX, a co-design of flexible architecture and optimal workload mapping for efficient and flexible TNN processing. TetriX adopts a unified processing architecture to support both inner and outer product. A hybrid mapping scheme is proposed to eliminate complex tensor shaping by alternating between inner and outer product in a sequence of tensor contractions. Finally, a mapping-aware contraction sequence search (MCSS) is proposed to identify the contraction sequence and workload mapping for achieving the optimal latency on TetriX. Remarkably, combining TetriX with MCSS outperforms the single-mode inner-product and outer-product baselines by up to 46.8<inline-formula><tex-math notation="LaTeX">\boldsymbol{\times}</tex-math> <mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhang-ieq1-3365936.gif"/> </inline-formula> in performance across the collected TNN workloads. TetriX is the first work to support all existing tensor decomposition methods. Compared to a TNN accelerator designed for the hierarchical Tucker method, TetriX achieves improvements of 6.5<inline-formula><tex-math notation="LaTeX">\boldsymbol{\times}</tex-math> <mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhang-ieq2-3365936.gif"/> </inline-formula> and 1.1<inline-formula><tex-math notation="LaTeX">\boldsymbol{\times}</tex-math> <mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhang-ieq3-3365936.gif"/> </inline-formula> in inference throughput and efficiency, respectively.
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2024.3365936