Linked motion image‐based dynamic hand gesture recognition

The researchers have paid significant attention to dynamic images for hand gesture recognition. Dynamic images are gesture representation patterns that simultaneously capture spatial, temporal, and structural information from the video. Existing techniques to generate dynamic images provide low disc...

Full description

Saved in:

Bibliographic Details
Published in	Computer animation and virtual worlds Vol. 34; no. 6
Main Authors	Jain, Rahul, Karsh, Ram Kumar, Barbhuiya, Abul Abbas
Format	Journal Article
Language	English
Published	Hoboken, USA John Wiley & Sons, Inc 01.11.2023 Wiley Subscription Services, Inc
Subjects	Accuracy Algorithms Artificial neural networks classifier convolutional neural network Datasets dynamic images Gesture recognition hand gesture recognition
Online Access	Get full text
ISSN	1546-4261 1546-427X
DOI	10.1002/cav.2137

Cover

Loading…

More Information
Summary:	The researchers have paid significant attention to dynamic images for hand gesture recognition. Dynamic images are gesture representation patterns that simultaneously capture spatial, temporal, and structural information from the video. Existing techniques to generate dynamic images provide low discriminability for the gestures, which follow the same trajectory, but in opposite directions, such as “swiping hand right” versus “swiping hand left.” Also, limited to similar gestures such as “Snap fingers” versus “Dual fingers heart.” To address these issues, we have proposed an algorithm to convert a depth video into a single dynamic image known as a linked motion image (LMI). We give the LMI to a classifier consisting of an ensemble of three modified pretrained convolutional neural networks. We conduct the experiments using a multimodal large‐scale EgoGesture dataset and The MSR Gesture 3D dataset. For the EgoGesture dataset, the proposed method achieved an accuracy of 92.91%, which is better than the state‐of‐the‐art methods. For the MSR Gesture 3D dataset, the proposed method accuracy is 100%, which outperforms the state‐of‐the‐art methods. This work also highlights the recognition accuracy and precision of each gesture. The experiments demonstrate the work's economic efficiency using a web‐based data science environment called Kaggle rather than high‐end systems like GPU. A dynamic image called linked motion image (LMI) has been proposed to encode depth gesture video. This idea is simple but effective in encoding the gesture video in a single image while preserving most of the motion information. This work uses the ensemble of three modified pre‐trained CNNs: VGG19, ResNet101, and Xception. Modified pre‐trained CNNs have been innovatively used for feature extraction, followed by a global average pooling (GAP) layer, a fully connected layer, and softmax function. These three CNNs are ensemble using average fusion.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1546-4261 1546-427X
DOI:	10.1002/cav.2137