ST-Xception: A Depthwise Separable Convolution Network for Military Sign Language Recognition

Military sign language is an important form of tactical communication, especially in restrict situations where either distance or a requirement for silence precludes oral means. Unfortunately, when soldiers cannot see each other, the communication mode of tactical gestures is no longer effective, wh...

Full description

Saved in:

Bibliographic Details
Published in	Conference proceedings - IEEE International Conference on Systems, Man, and Cybernetics pp. 3200 - 3205
Main Authors	Zhang, Yuhao, Liao, Jun, Ran, Mengyuan, Li, Xin, Wang, Shanshan, Liu, Li
Format	Conference Proceeding
Language	English
Published	IEEE 11.10.2020
Subjects	Assistive technology Benchmark testing depthwise separable convolution Gesture recognition hand gesture recognition Military communication Military computing Military sign language Real-time systems tactical hand gesture dataset Task analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Military sign language is an important form of tactical communication, especially in restrict situations where either distance or a requirement for silence precludes oral means. Unfortunately, when soldiers cannot see each other, the communication mode of tactical gestures is no longer effective, which may hinder military operations. Vision-based approaches have been at the forefront in the field of hand gesture recognition. However, there still lacks of specific datasets and models for the task of military sign language recognition. In this paper, we collected a new first-person dataset named MSL, which contains 16 classes of 3, 840 tactical gesture samples on battle scenario with more than 11, 0000 video frames performed by 10 subjects. Moreover, we present a novel deep network, called ST-Xception architecture, in light of the depthwise separable convolutions to recognize such military sign language. By expanding the convolution filters and pooling kernels into 3D, our network can characterize the inherent spatio-temporal relationship of a certain tactical hand gesture. In particular, we further reduce computational cost and relieve overfitting by replacing the fully connected layers with adaptive average pooling. Experimental results show that our model outperforms existing models both on our in-house MSL dataset and two other benchmark datasets.
ISSN:	2577-1655
DOI:	10.1109/SMC42975.2020.9283407