Siamese visual tracking combining granular level multi-scale features and global information

Despite the great success achieved in visual tracking, it is still hard for most trackers to address scenes with targets subject to large-scale changes and similar objects. The capacity of existing methods is first insufficient to efficiently extract multi-scale features. Then, convolutional neural...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 252; p. 109435
Main Authors	Liang, Wei, Ding, Derui, Wei, Guoliang
Format	Journal Article
Language	English
Published	Elsevier B.V 27.09.2022
Subjects	Multi-scale feature Self attention Siamese network Transformer Visual tracking Siamese network Visual tracking Self attention Transformer Multi-scale feature
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Despite the great success achieved in visual tracking, it is still hard for most trackers to address scenes with targets subject to large-scale changes and similar objects. The capacity of existing methods is first insufficient to efficiently extract multi-scale features. Then, convolutional neural networks focus primarily on local characteristics while easily ignoring global characteristics, which is essential for visual tracking. Furthermore, the recently popular tracking methods based on Siamese-like networks can perform the image matching of two branches through simple cross-correlation operations, and cannot effectively establish their connection. An improved Siamese tracking network, called GSiamMS, is proposed to address these challenges via the integration of Res2Net blocks and transformer modules. Within this network, a feature extraction module based on Res2Net blocks is constructed to obtain multi-scale information from the granular level without relying on multi-layer outputs. Then, the cross-attention mechanism is utilized to learn the connection between template features and search features while the self-attention mechanism focusing on the global information establishes long-range dependencies between the object and the background. Finally, numerous experiments on visual tracking benchmarks including TrackingNet, GOT-10k, LaSOT, NFS, UAV123, and TNL2K are implemented to verify that the developed method running at 38fps achieves the superior performance compared with several state-of-the-art methods. •An improved Siamese tracking network is constructed via Res2Net and transformers.•Multi-scale information from granular levels is used via feature extraction modules.•A cross-attention module is used to learn the connection of different features.•A self-attention module is employed to establish long-range dependencies.•Empirical studies on public datasets demonstrate the effectiveness of our models.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2022.109435