Transformer-Based Two-Stream Network for Global and Local Motion Estimation

Motion estimation in videos primarily concerns global and local motion, derived from different subjects but mixed in video frames. In fact, in most scenes, such as action recognition, it is necessary to estimate the global and local motion respectively in order to obtain accurate motion representati...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML) pp. 328 - 334
Main Authors	Zheng, Yihao, Li, Zun, Wang, Zhuming, Wu, Lifang
Format	Conference Proceeding
Language	English
Published	IEEE 04.08.2023
Subjects	Activity recognition Estimation Interference Machine learning Motion estimation motion pattern optical flow Training Transformers video understanding
Online Access	Get full text
DOI	10.1109/PRML59573.2023.10348241

Cover

Loading…

More Information
Summary:	Motion estimation in videos primarily concerns global and local motion, derived from different subjects but mixed in video frames. In fact, in most scenes, such as action recognition, it is necessary to estimate the global and local motion respectively in order to obtain accurate motion representation. Due to the lack of ground-truth labels, estimating global and local motion simultaneously poses a challenge. In this work, we address these issues with an end-to-end two-stream network for global and local motion estimation. This network utilizes the mixed motion as supervision, employs the attention mechanism based on the Transformer, and adopts a two-stage training strategy for mutual enhancement of the two motions during training. Additionally, we introduce a motion-based feature decoder for the global stream and a SIR mask to remove scene-irrelevant regions for the local stream. We verify the effectiveness of our method on the deep homography estimation dataset DHE, action recognition dataset NCAA, and group activity recognition dataset UCF-101. Results demonstrate improved performance over previous methods in regular scenes and recognition tasks.
DOI:	10.1109/PRML59573.2023.10348241