Towards Smartphone-based 3D Hand Pose Reconstruction Using Acoustic Signals

Accurately reconstructing 3D hand poses is a pivotal element for numerous Human-Computer Interaction applications. In this work, we propose SonicHand, the first smartphone-based 3D hand pose reconstruction system using purely inaudible acoustic signals. SonicHand incorporates signal processing techn...

Full description

Saved in:

Bibliographic Details
Published in	ACM transactions on sensor networks Vol. 20; no. 5; pp. 1 - 32
Main Authors	Wang, Shiyang, Wang, Xingchen, Jiang, Wenjun, Miao, Chenglin, Cao, Qiming, Wang, Haoyu, Sun, Ke, Xue, Hongfei, Su, Lu
Format	Journal Article
Language	English
Published	New York, NY ACM 26.08.2024
Subjects	Human-centered computing Interaction techniques Ubiquitous and mobile computing deep learning Acoustic sensing signal processing hand pose reconstruction device free domain generalization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Accurately reconstructing 3D hand poses is a pivotal element for numerous Human-Computer Interaction applications. In this work, we propose SonicHand, the first smartphone-based 3D hand pose reconstruction system using purely inaudible acoustic signals. SonicHand incorporates signal processing techniques and a deep learning framework to address a series of challenges. First, it encodes the topological information of the hand skeleton as prior knowledge and utilizes a deep learning model to realistically and smoothly reconstruct the hand poses. Second, the system employs adversarial training to enhance the generalization ability of our system to be deployed in a new environment or for a new user. Third, we adopt a hand tracking method based on channel impulse response estimation. It enables our system to handle the scenario where the hand performs gestures while moving arbitrarily as a whole. We conduct extensive experiments on a smartphone testbed to demonstrate the effectiveness and robustness of our system from various dimensions. The experiments involve 10 subjects performing up to 12 different hand gestures in three distinctive environments. When the phone is held in one of the user’s hands, the proposed system can track joints with an average error of 18.64 mm.
ISSN:	1550-4859 1550-4867
DOI:	10.1145/3677122