Weighted Transformer for Dialect Speech Recognition
End-to-end automatic speech recognition (ASR) with transformer models has recently made significant progress, surpassing Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) mainly because it can model larger context in a parameter-efficient way using self-attention and feed-forw...
Saved in:
Published in | 2022 IEEE International Conference on Knowledge Graph (ICKG) pp. 381 - 385 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.11.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | End-to-end automatic speech recognition (ASR) with transformer models has recently made significant progress, surpassing Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) mainly because it can model larger context in a parameter-efficient way using self-attention and feed-forward layers. Among them, the self-attention mechanism has received great attention from scholars and it has demonstrated promising results in various natural speech recognition (ASR) tasks. However, the classical transformer-based approaches usually require a large number of parameters and training data, with many training iterations to converge well. In addition, dialect is a common phenomenon in our daily life, and one may not have a large amount of data to start with. More importantly, different dialects have obvious differences at the phonological level although they may share large similarities at the lexical level. Thus, efficient training of highly accurate ASR models for various dialects is very much desirable but remains a challenging problem. In this work, we introduce a weighted transformer that makes better use of the multi-head attention mechanism and obtains more accurate results for four spoken English dialects. We pretrain our base models including weighted transformers using the 960 hours LibriSpeech dataset and adapt them on English dialect data of Common Voice and LibriSpeech SLR83 speech datasets respectively. Experimental results further show that the added weight can distinguish different dialects to obtain better representation. And our proposed dialect-dependent ASR system is significantly more accurate than the classical transformer baseline. In addition, during the training process, we found that the training speed of the new model has been improved by 15%-30%. |
---|---|
DOI: | 10.1109/ICKG55886.2022.00055 |