Triple-level relationship enhanced transformer for image captioning

Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of pr...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia systems Vol. 29; no. 4; pp. 1955 - 1966
Main Authors	Zheng, Anqi, Zheng, Shiqi, Bai, Cong, Chen, Deng
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.08.2023 Springer Nature B.V
Subjects	Computer Communication Networks Computer Graphics Computer Science Cryptology Data Storage Representation Image enhancement Multilayers Multimedia Information Systems Operating Systems Regular Paper Transformers Feature fusion Transformer Image captioning Attention
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of precise expression of visual relationships. Meanwhile, the scene graph features contain object relationship information, through the multi-layer calculation, and the extracted object relationship information is of higher level and more complete, which can compensate the shortage of region features and grid features to a certain extent. Therefore, a Triple-Level Relationship Enhanced Transformer (TRET) is proposed in this paper, which can process three features in parallel. TRET can obtain and combine different levels of object relationship features to achieve the advantages of complementarity between different features. Specifically, we obtain high-level object-relational information with the help of Graph Based Attention, and achieve the fusion of low-level relational information and high-level object-relational information with the help of Cross Relationship Enhanced Attention, so as to better align the information of both modalities, visual and text. To validate our model, we conduct comprehensive experiments on the MS-COCO dataset. The results indicate that our method achieves better performance compared with the existing state-of-the-art methods and effectively enhances the ability of describing the representation of object relationships in the generated outcomes.
ISSN:	0942-4962 1432-1882
DOI:	10.1007/s00530-023-01073-2