TSCompiler: efficient compilation framework for dynamic-shape models

Today’s deep learning models face an increasing demand to handle dynamic shape tensors and computation whose shape information remains unknown at compile time and varies in a nearly infinite range at runtime. This shape dynamism brings tremendous challenges for existing compilation pipelines designe...

Full description

Saved in:

Bibliographic Details
Published in	Science China. Information sciences Vol. 67; no. 10; p. 200403
Main Authors	Luo, Xiang, Zhang, Chen, Geng, Chenbo, Yi, Yanzhi, Hu, Jiahui, Zhang, Renwei, Zhang, Zhen, Consolaro, Gianpietro, Yang, Fan, Lu, Tun, Gu, Ning, Shang, Li
Format	Journal Article
Language	English
Published	Beijing Science China Press 01.10.2024 Springer Nature B.V
Subjects	Algorithms Computation Computer Science Cost analysis Deep learning Efficiency Graph theory Information Systems and Communication Service Large language models Machine learning Research Paper Run time (computers) Schedules State-of-the-art reviews Static models Tensors autotuning code generation machine learning tensor compilers operator fusion dynamic shape
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Today’s deep learning models face an increasing demand to handle dynamic shape tensors and computation whose shape information remains unknown at compile time and varies in a nearly infinite range at runtime. This shape dynamism brings tremendous challenges for existing compilation pipelines designed for static models which optimize tensor programs relying on exact shape values. This paper presents TSCompiler, an end-to-end compilation framework for dynamic shape models. TSCompiler first proposes a symbolic shape propagation algorithm to recover symbolic shape information at compile time to enable subsequent optimizations. TSCompiler then partitions the shape-annotated computation graph into multiple subgraphs and fine-tunes the backbone operators from the subgraph within a hardware-aligned search space to find a collection of high-performance schedules. TSCompiler can propagate the explored backbone schedule to other fusion groups within the same subgraph to generate a set of parameterized tensor programs for fused cases based on dependence analysis. At runtime, TSCompiler utilizes an occupancy-targeted cost model to select from pre-compiled tensor programs for varied tensor shapes. Extensive evaluations show that TSCompiler can achieve state-of-the-art speedups for dynamic shape models. For example, we can improve kernel efficiency by up to 3.97× on NVIDIA RTX3090, and 10.30 × on NVIDIA A100 and achieve up to five orders of magnitude speedups on end-to-end latency.
ISSN:	1674-733X 1869-1919
DOI:	10.1007/s11432-024-4071-6