Triple-level relationship enhanced transformer for image captioning
Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of pr...
Saved in:
Published in | Multimedia systems Vol. 29; no. 4; pp. 1955 - 1966 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.08.2023
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of precise expression of visual relationships. Meanwhile, the scene graph features contain object relationship information, through the multi-layer calculation, and the extracted object relationship information is of higher level and more complete, which can compensate the shortage of region features and grid features to a certain extent. Therefore, a Triple-Level Relationship Enhanced Transformer (TRET) is proposed in this paper, which can process three features in parallel. TRET can obtain and combine different levels of object relationship features to achieve the advantages of complementarity between different features. Specifically, we obtain high-level object-relational information with the help of Graph Based Attention, and achieve the fusion of low-level relational information and high-level object-relational information with the help of Cross Relationship Enhanced Attention, so as to better align the information of both modalities, visual and text. To validate our model, we conduct comprehensive experiments on the MS-COCO dataset. The results indicate that our method achieves better performance compared with the existing state-of-the-art methods and effectively enhances the ability of describing the representation of object relationships in the generated outcomes. |
---|---|
ISSN: | 0942-4962 1432-1882 |
DOI: | 10.1007/s00530-023-01073-2 |