Class-attention Video Transformer for Engagement Intensity Prediction

In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embe...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Ai, Xusheng, Sheng, Victor S, Li, Chunhua, Cui, Zhiming
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 10.11.2022
Subjects	Datasets Feature extraction Sampling methods Sequences Transformers Video
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models have been made publicly available at https://github.com/mountainai/cavt.
ISSN:	2331-8422