MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding

Unlike vision modalities, body-worn sensors or passive sensing can avoid the failure of action understanding in vision related challenges, e.g. occlusion and appearance variation. However, a standard large-scale dataset does not exist, in which different types of modalities across vision and sensors...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Computer Vision pp. 8657 - 8666
Main Authors	Kong, Quan, Wu, Ziming, Deng, Ziwei, Klinkigt, Martin, Tong, Bin, Murakami, Tomokazu
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2019
Subjects	Adaptation models Cameras Gyroscopes Sensors Task analysis Three-dimensional displays Videos
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Unlike vision modalities, body-worn sensors or passive sensing can avoid the failure of action understanding in vision related challenges, e.g. occlusion and appearance variation. However, a standard large-scale dataset does not exist, in which different types of modalities across vision and sensors are integrated. To address the disadvantage of vision-based modalities and push towards multi/cross modal action understanding, this paper introduces a new large-scale dataset recorded from 20 distinct subjects with seven different types of modalities: RGB videos, keypoints, acceleration, gyroscope, orientation, Wi-Fi and pressure signal. The dataset consists of more than 36k video clips for 37 action classes covering a wide range of daily life activities such as desktop-related and check-in-based ones in four different distinct scenarios. On the basis of our dataset, we propose a novel multi modality distillation model with attention mechanism to realize an adaptive knowledge transfer from sensor-based modalities to vision-based modalities. The proposed model significantly improves performance of action recognition compared to models trained with only RGB information. The experimental results confirm the effectiveness of our model on cross-subject, -view, -scene and -session evaluation criteria. We believe that this new large-scale multimodal dataset will contribute the community of multimodal based action understanding.
ISSN:	2380-7504
DOI:	10.1109/ICCV.2019.00875