Contextual Explainable Video Representation: Human Perception-based Understanding
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video underst...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
12.12.2022
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2212.06206 |
Cover
Summary: | Video understanding is a growing field and a subject of intense research,
which includes many interesting tasks to understanding both spatial and
temporal information, e.g., action detection, action recognition, video
captioning, video retrieval. One of the most challenging problems in video
understanding is dealing with feature extraction, i.e. extract contextual
visual representation from given untrimmed video due to the long and
complicated temporal structure of unconstrained videos. Different from existing
approaches, which apply a pre-trained backbone network as a black-box to
extract visual representation, our approach aims to extract the most contextual
information with an explainable mechanism. As we observed, humans typically
perceive a video through the interactions between three main factors, i.e., the
actors, the relevant objects, and the surrounding environment. Therefore, it is
very crucial to design a contextual explainable video representation extraction
that can capture each of such factors and model the relationships between them.
In this paper, we discuss approaches, that incorporate the human perception
process into modeling actors, objects, and the environment. We choose video
paragraph captioning and temporal action detection to illustrate the
effectiveness of human perception based-contextual representation in video
understanding. Source code is publicly available at
https://github.com/UARK-AICV/Video_Representation. |
---|---|
DOI: | 10.48550/arxiv.2212.06206 |