Fine-Grained Features Alignment and Fusion for Text-Video Cross-Modal Retrieval

Text-video cross-modal retrieval is an increasingly prominent and challenging task that has garnered significant attention. Traditional models typically embed videos and texts into global vectors, aiming to capture the global features of these modalities. While the models often fall short in capturi...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 3325 - 3329
Main Authors Zhang, Shuili, Mu, Hongzhang, Li, Quangang, Xiao, Chenglong, Liu, Tingwen
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Text-video cross-modal retrieval is an increasingly prominent and challenging task that has garnered significant attention. Traditional models typically embed videos and texts into global vectors, aiming to capture the global features of these modalities. While the models often fall short in capturing fine-grained semantic details. Relying solely on global features proves insufficient to address this challenge. Hence, there is a pressing need to bridge the gap between different modalities by incorporating fine-grained features. In light of this, we propose a highly efficient model designed to capture the fine-grained features of videos and texts including question answer semantic alignment, object alignment and text-video feature fusion. For texts, our model includes the incorporation of entity information and part-of-speech information including adjectives, nouns and verbs information, while for videos, the identification of objects plays a crucial role in facilitating text-video retrieval. Our model undergoes extensive training on the WebVid and CC3M datasets, yielding unequivocal evidence of its superior performance over baseline models. It excels particularly in zero-shot text-video cross-modal retrieval tasks, offering substantial reductions in required computational resources.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10446511