Compressed Vision Transformer for Scene Text Recognition

With the advancement of scene text recognition and deep learning, an increasing number of models have been proposed and applied to scene text recognition tasks. However, deploying these powerful yet computationally intensive models on resource-constrained devices is challenging. Model pruning is one...

Full description

Saved in:
Bibliographic Details
Published in2024 7th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI) pp. 01 - 05
Main Authors Ren, Jinbiao, Deng, Tao, Huang, Yanlin, Qu, Da, Su, Jianqiu, Li, Bingen
Format Conference Proceeding
LanguageEnglish
Published IEEE 20.12.2024
Subjects
Online AccessGet full text
DOI10.1109/ACAI63924.2024.10899477

Cover

More Information
Summary:With the advancement of scene text recognition and deep learning, an increasing number of models have been proposed and applied to scene text recognition tasks. However, deploying these powerful yet computationally intensive models on resource-constrained devices is challenging. Model pruning is one of the most effective methods for compressing and accelerating these models, as it reduces the number of parameters and computational load by removing less critical parameters or structures. In ViT models, each parameter influences its neighboring parameters locally. Therefore, rather than pruning solely based on parameter magnitude, we propose selecting parameters for removal based on their local influence. By calculating the combined impact of each parameter along with its neighbors, we identify and prune those with minimal overall influence on the model, achieving compression and acceleration without significantly compromising accuracy. Our pruning method substantially reduces parameter count and computational cost while preserving accuracy, as demonstrated across seven test datasets and in comparison with more than five similar STR algorithms.
DOI:10.1109/ACAI63924.2024.10899477