Transformer-Based Two-Stream Network for Global and Local Motion Estimation

Motion estimation in videos primarily concerns global and local motion, derived from different subjects but mixed in video frames. In fact, in most scenes, such as action recognition, it is necessary to estimate the global and local motion respectively in order to obtain accurate motion representati...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML) pp. 328 - 334
Main Authors Zheng, Yihao, Li, Zun, Wang, Zhuming, Wu, Lifang
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.08.2023
Subjects
Online AccessGet full text
DOI10.1109/PRML59573.2023.10348241

Cover

Loading…
More Information
Summary:Motion estimation in videos primarily concerns global and local motion, derived from different subjects but mixed in video frames. In fact, in most scenes, such as action recognition, it is necessary to estimate the global and local motion respectively in order to obtain accurate motion representation. Due to the lack of ground-truth labels, estimating global and local motion simultaneously poses a challenge. In this work, we address these issues with an end-to-end two-stream network for global and local motion estimation. This network utilizes the mixed motion as supervision, employs the attention mechanism based on the Transformer, and adopts a two-stage training strategy for mutual enhancement of the two motions during training. Additionally, we introduce a motion-based feature decoder for the global stream and a SIR mask to remove scene-irrelevant regions for the local stream. We verify the effectiveness of our method on the deep homography estimation dataset DHE, action recognition dataset NCAA, and group activity recognition dataset UCF-101. Results demonstrate improved performance over previous methods in regular scenes and recognition tasks.
DOI:10.1109/PRML59573.2023.10348241