Linked motion image‐based dynamic hand gesture recognition
The researchers have paid significant attention to dynamic images for hand gesture recognition. Dynamic images are gesture representation patterns that simultaneously capture spatial, temporal, and structural information from the video. Existing techniques to generate dynamic images provide low disc...
Saved in:
Published in | Computer animation and virtual worlds Vol. 34; no. 6 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Hoboken, USA
John Wiley & Sons, Inc
01.11.2023
Wiley Subscription Services, Inc |
Subjects | |
Online Access | Get full text |
ISSN | 1546-4261 1546-427X |
DOI | 10.1002/cav.2137 |
Cover
Loading…
Summary: | The researchers have paid significant attention to dynamic images for hand gesture recognition. Dynamic images are gesture representation patterns that simultaneously capture spatial, temporal, and structural information from the video. Existing techniques to generate dynamic images provide low discriminability for the gestures, which follow the same trajectory, but in opposite directions, such as “swiping hand right” versus “swiping hand left.” Also, limited to similar gestures such as “Snap fingers” versus “Dual fingers heart.” To address these issues, we have proposed an algorithm to convert a depth video into a single dynamic image known as a linked motion image (LMI). We give the LMI to a classifier consisting of an ensemble of three modified pretrained convolutional neural networks. We conduct the experiments using a multimodal large‐scale EgoGesture dataset and The MSR Gesture 3D dataset. For the EgoGesture dataset, the proposed method achieved an accuracy of 92.91%, which is better than the state‐of‐the‐art methods. For the MSR Gesture 3D dataset, the proposed method accuracy is 100%, which outperforms the state‐of‐the‐art methods. This work also highlights the recognition accuracy and precision of each gesture. The experiments demonstrate the work's economic efficiency using a web‐based data science environment called Kaggle rather than high‐end systems like GPU.
A dynamic image called linked motion image (LMI) has been proposed to encode depth gesture video. This idea is simple but effective in encoding the gesture video in a single image while preserving most of the motion information. This work uses the ensemble of three modified pre‐trained CNNs: VGG19, ResNet101, and Xception. Modified pre‐trained CNNs have been innovatively used for feature extraction, followed by a global average pooling (GAP) layer, a fully connected layer, and softmax function. These three CNNs are ensemble using average fusion. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1546-4261 1546-427X |
DOI: | 10.1002/cav.2137 |