Entity Dependency Learning Network With Relation Prediction for Video Visual Relation Detection

Video Visual Relation Detection (VidVRD) is a pivotal task in the field of video analysis. It involves detecting object trajectories in videos, predicting potential dynamic relation between these trajectories, and ultimately representing these relationships in the form of <subject, predicate, obj...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 34; no. 12; pp. 12425 - 12436
Main Authors Zhang, Guoguang, Tang, Yepeng, Zhang, Chunjie, Zheng, Xiaolong, Zhao, Yao
Format Journal Article
LanguageEnglish
Published New York IEEE 01.12.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Video Visual Relation Detection (VidVRD) is a pivotal task in the field of video analysis. It involves detecting object trajectories in videos, predicting potential dynamic relation between these trajectories, and ultimately representing these relationships in the form of <subject, predicate, object> triplets. Correct prediction of relation is vital for VidVRD. Existing methods mostly adopt the simple fusion of visual and language features of entity trajectories as the feature representation for relation predicates. However, these methods do not take into account the dependency information between the relation predication and the subject and object within the triplet. To address this issue, we propose the entity dependency learning network(EDLN), which can capture the dependency information between relation predicates and subjects, objects, and subject-object pairs. It adaptively integrates these dependency information into the feature representation of relation predicates. Additionally, to effectively model the features of the relation existing between various object entities pairs, in the context encoding phase for relation predicate features, we introduce a fully convolutional encoding approach as a substitute for the self-attention mechanism in the Transformer. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed EDLN.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3437437