Towards Recognition for Radio-echo Speech in Air Traffic Control: Dataset and a Contrastive Learning Approach

In the air traffic control (ATC) domain, automatic speech recognition (ASR) suffers from radio speech echo, which cannot be addressed by existing echo cancellation due to auditory-oriented optimization and poor generalization ability caused by volatile radio transmission. In this work, a contrastive...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on audio, speech, and language processing Vol. 31; pp. 1 - 14
Main Authors Lin, Yi, Wang, Qingyang, Yu, Xincheng, Zhang, Zichen, Guo, Dongyue, Zhou, Jizhe
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In the air traffic control (ATC) domain, automatic speech recognition (ASR) suffers from radio speech echo, which cannot be addressed by existing echo cancellation due to auditory-oriented optimization and poor generalization ability caused by volatile radio transmission. In this work, a contrastive learning-based framework is proposed to tackle the radio-echo speech for the ASR task based on convolution networks with multiple paths and recurrent neural networks. 1) By analyzing the communication mechanism of the ATC speech, a novel transmission method is designed to collect clean and noisy speech samples (with the same texts) via a bypass device in a real-world ATC environment. 2) To enhance the model capacity, a temporal and frequency attention block is innovatively designed to guide the model to focus on informative frames and frequencies, aiming at learning shared representations between the clean and noisy speech signals with the same texts. 3) By incorporating contrastive loss, the proposed approach is implemented by a multi-objective optimization, in which the loss weights are dynamically determined to enhance the ASR performance in a learnable manner. With the proposed transmission method, a real-world dataset is collected and annotated to validate the proposed approach. Experimental results demonstrate that the proposed approach outperforms other comparative baselines with different technical frameworks, achieving a 6.76% character error rate on the test dataset. Most importantly, all the proposed improvements are confirmed by designed experiments, in which contrastive learning with learnable multi-objective loss weights contributes to the primary performance improvement.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2023.3307219