Class-attention Video Transformer for Engagement Intensity Prediction

In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embe...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Ai, Xusheng, Sheng, Victor S, Li, Chunhua, Cui, Zhiming
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 10.11.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models have been made publicly available at https://github.com/mountainai/cavt.
ISSN:2331-8422