Towards Recognition for Radio-echo Speech in Air Traffic Control: Dataset and a Contrastive Learning Approach

In the air traffic control (ATC) domain, automatic speech recognition (ASR) suffers from radio speech echo, which cannot be addressed by existing echo cancellation due to auditory-oriented optimization and poor generalization ability caused by volatile radio transmission. In this work, a contrastive...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 31; pp. 1 - 14
Main Authors	Lin, Yi, Wang, Qingyang, Yu, Xincheng, Zhang, Zichen, Guo, Dongyue, Zhou, Jizhe
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Air traffic control Automatic Speech Recognition Contrastive Learning Datasets Error analysis Hidden Markov models Learnable Loss Weights Learning Multiple objective analysis Neural networks Noise measurement Optimization Performance evaluation Radio Radio transmission Radio-echo Speech Recurrent neural networks Speech Speech enhancement Speech recognition Task analysis Temporal and Frequency Attention Texts Voice recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In the air traffic control (ATC) domain, automatic speech recognition (ASR) suffers from radio speech echo, which cannot be addressed by existing echo cancellation due to auditory-oriented optimization and poor generalization ability caused by volatile radio transmission. In this work, a contrastive learning-based framework is proposed to tackle the radio-echo speech for the ASR task based on convolution networks with multiple paths and recurrent neural networks. 1) By analyzing the communication mechanism of the ATC speech, a novel transmission method is designed to collect clean and noisy speech samples (with the same texts) via a bypass device in a real-world ATC environment. 2) To enhance the model capacity, a temporal and frequency attention block is innovatively designed to guide the model to focus on informative frames and frequencies, aiming at learning shared representations between the clean and noisy speech signals with the same texts. 3) By incorporating contrastive loss, the proposed approach is implemented by a multi-objective optimization, in which the loss weights are dynamically determined to enhance the ASR performance in a learnable manner. With the proposed transmission method, a real-world dataset is collected and annotated to validate the proposed approach. Experimental results demonstrate that the proposed approach outperforms other comparative baselines with different technical frameworks, achieving a 6.76% character error rate on the test dataset. Most importantly, all the proposed improvements are confirmed by designed experiments, in which contrastive learning with learnable multi-objective loss weights contributes to the primary performance improvement.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2023.3307219