A multi-tasking model of speaker-keyword classification for keeping human in the loop of drone-assisted inspection

Audio commands are a preferred communication medium to keep inspectors in the loop of civil infrastructure inspection performed by a semi-autonomous drone. To understand job-specific commands from a group of heterogeneous and dynamic inspectors, a model must be developed cost-effectively for the gro...

Full description

Saved in:
Bibliographic Details
Published inEngineering applications of artificial intelligence Vol. 117; p. 105597
Main Authors Li, Yu, Parsan, Anisha, Wang, Bill, Dong, Penghao, Yao, Shanshan, Qin, Ruwen
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.01.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Audio commands are a preferred communication medium to keep inspectors in the loop of civil infrastructure inspection performed by a semi-autonomous drone. To understand job-specific commands from a group of heterogeneous and dynamic inspectors, a model must be developed cost-effectively for the group and easily adapted when the group changes. This paper is motivated to build a multi-tasking deep learning model that possesses a Share–Split–Collaborate architecture. This architecture allows the two classification tasks to share the feature extractor and then split subject-specific and keyword-specific features intertwined in the extracted features through feature projection and collaborative training. A base model for a group of five authorized subjects is trained and tested on the inspection keyword dataset collected by this study. The model achieved a 95.3% or higher mean accuracy in classifying the keywords of any authorized inspectors. Its mean accuracy in speaker classification is 99.2%. Due to the richer keyword representations that the model learns from the pooled training data Adapting the base model to a new inspector requires only a little training data from that inspector Like five utterances per keyword. Using the speaker classification scores for inspector verification can achieve a success rate of at least 93.9% in verifying authorized inspectors and 76.1% in detecting unauthorized ones. Further The paper demonstrates the applicability of the proposed model to larger-size groups on a public dataset. This paper provides a solution to addressing challenges facing AI-assisted human–robot interaction Including worker heterogeneity Worker dynamics And job heterogeneity. •The Share–Split–Collaborate multitask learning architecture is suitable for speaker-keyword classification.•Subject-specific and phonetic-specific features intertwined in audio data can be disentangled.•Rich keyword representations are learned from multi-subject spoken command data.•Small data of new speakers are sufficient for adding new classes to the speaker classifier.•Speaker classification scores are also effective for the speaker verification.
ISSN:0952-1976
1873-6769
DOI:10.1016/j.engappai.2022.105597