En-HACN: Enhancing Hybrid Architecture With Fast Attention and Capsule Network for End-to-end Speech Recognition

Automatic speech recognition (ASR) is a fundamental technology in the field of artificial intelligence. End-to-end (E2E) ASR is favored for its state-of-the-art performance. However, E2E speech recognition still faces speech spatial information loss and text local information loss, which results in...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on audio, speech, and language processing Vol. 31; pp. 1050 - 1062
Main Authors Lyu, Boyang, Fan, Chunxiao, Ming, Yue, Zhao, Panzi, Hu, Nannan
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Automatic speech recognition (ASR) is a fundamental technology in the field of artificial intelligence. End-to-end (E2E) ASR is favored for its state-of-the-art performance. However, E2E speech recognition still faces speech spatial information loss and text local information loss, which results in the increase of deletion and substitution errors during inference. To overcome this challenge, we propose a novel Enhancing Hybrid Architecture with Fast Attention and Capsule Network (termed En-HACN), which can model the position relationships between different acoustic unit features to improve the discriminability of speech features while providing the text local information during inference. Firstly, a new CNN-Capsule Network (CNN-Caps) module is proposed to capture the spatial information in the spectrogram through the capsule output and dynamic routing mechanism. Then, we design a novel hybrid structure of LocalGRU Augmented Decoder (LA-decoder) that generates text hidden representations to obtain text local information of the target sequences. Finally, we introduce fast attention instead of self-attention in En-HACN, which improves the generalization ability and efficiency of the model in long utterances. Experiments on corpora Aishell-1 and Librispeech demonstrate that our En-HACN has achieved the state-of-the-art compared with existing works. Besides, experiments on the long utterances dataset based on Aishell-1-long show that our model has a high generalization ability and efficiency.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2023.3245407