Linked motion image‐based dynamic hand gesture recognition

The researchers have paid significant attention to dynamic images for hand gesture recognition. Dynamic images are gesture representation patterns that simultaneously capture spatial, temporal, and structural information from the video. Existing techniques to generate dynamic images provide low disc...

Full description

Saved in:
Bibliographic Details
Published inComputer animation and virtual worlds Vol. 34; no. 6
Main Authors Jain, Rahul, Karsh, Ram Kumar, Barbhuiya, Abul Abbas
Format Journal Article
LanguageEnglish
Published Hoboken, USA John Wiley & Sons, Inc 01.11.2023
Wiley Subscription Services, Inc
Subjects
Online AccessGet full text
ISSN1546-4261
1546-427X
DOI10.1002/cav.2137

Cover

Loading…
More Information
Summary:The researchers have paid significant attention to dynamic images for hand gesture recognition. Dynamic images are gesture representation patterns that simultaneously capture spatial, temporal, and structural information from the video. Existing techniques to generate dynamic images provide low discriminability for the gestures, which follow the same trajectory, but in opposite directions, such as “swiping hand right” versus “swiping hand left.” Also, limited to similar gestures such as “Snap fingers” versus “Dual fingers heart.” To address these issues, we have proposed an algorithm to convert a depth video into a single dynamic image known as a linked motion image (LMI). We give the LMI to a classifier consisting of an ensemble of three modified pretrained convolutional neural networks. We conduct the experiments using a multimodal large‐scale EgoGesture dataset and The MSR Gesture 3D dataset. For the EgoGesture dataset, the proposed method achieved an accuracy of 92.91%, which is better than the state‐of‐the‐art methods. For the MSR Gesture 3D dataset, the proposed method accuracy is 100%, which outperforms the state‐of‐the‐art methods. This work also highlights the recognition accuracy and precision of each gesture. The experiments demonstrate the work's economic efficiency using a web‐based data science environment called Kaggle rather than high‐end systems like GPU. A dynamic image called linked motion image (LMI) has been proposed to encode depth gesture video. This idea is simple but effective in encoding the gesture video in a single image while preserving most of the motion information. This work uses the ensemble of three modified pre‐trained CNNs: VGG19, ResNet101, and Xception. Modified pre‐trained CNNs have been innovatively used for feature extraction, followed by a global average pooling (GAP) layer, a fully connected layer, and softmax function. These three CNNs are ensemble using average fusion.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1546-4261
1546-427X
DOI:10.1002/cav.2137