Knowledge-driven Subword Grammar Modeling for Automatic Speech Recognition in Tamil and Kannada
In this paper, we present specially designed automatic speech recognition (ASR) systems for the highly agglutinative and inflective languages of Tamil and Kannada that can recognize unlimited vocabulary of words. We use subwords as the basic lexical units for recognition and construct subword gramma...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
27.07.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we present specially designed automatic speech recognition
(ASR) systems for the highly agglutinative and inflective languages of Tamil
and Kannada that can recognize unlimited vocabulary of words. We use subwords
as the basic lexical units for recognition and construct subword grammar
weighted finite state transducer (SG-WFST) graphs for word segmentation that
captures most of the complex word formation rules of the languages. We have
identified the following category of words (i) verbs, (ii) nouns, (ii)
pronouns, and (iv) numbers. The prefix, infix and suffix lists of subwords are
created for each of these categories and are used to design the SG-WFST graphs.
We also present a heuristic segmentation algorithm that can even segment
exceptional words that do not follow the rules encapsulated in the SG-WFST
graph. Most of the data-driven subword dictionary creation algorithms are
computation driven, and hence do not guarantee morpheme-like units and so we
have used the linguistic knowledge of the languages and manually created the
subword dictionaries and the graphs. Finally, we train a deep neural network
acoustic model and combine it with the pronunciation lexicon of the subword
dictionary and the SG-WFST graph to build the subword-ASR systems. Since the
subword-ASR produces subword sequences as output for a given test speech, we
post-process its output to get the final word sequence, so that the actual
number of words that can be recognized is much higher. Upon experimenting the
subword-ASR system with the IISc-MILE Tamil and Kannada ASR corpora, we observe
an absolute word error rate reduction of 12.39% and 13.56% over the baseline
word-based ASR systems for Tamil and Kannada, respectively. |
---|---|
DOI: | 10.48550/arxiv.2207.13333 |