Batch-normalized joint training for DNN-based distant speech recognition
Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
24.03.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Improving distant speech recognition is a crucial step towards flexible
human-machine interfaces. Current technology, however, still exhibits a lack of
robustness, especially when adverse acoustic conditions are met. Despite the
significant progress made in the last years on both speech enhancement and
speech recognition, one potential limitation of state-of-the-art technology
lies in composing modules that are not well matched because they are not
trained jointly. To address this concern, a promising approach consists in
concatenating a speech enhancement and a speech recognition deep neural network
and to jointly update their parameters as if they were within a single bigger
network. Unfortunately, joint training can be difficult because the output
distribution of the speech enhancement system may change substantially during
the optimization procedure. The speech recognition module would have to deal
with an input distribution that is non-stationary and unnormalized. To mitigate
this issue, we propose a joint training approach based on a fully
batch-normalized architecture. Experiments, conducted using different datasets,
tasks and acoustic conditions, revealed that the proposed framework
significantly overtakes other competitive solutions, especially in challenging
environments. |
---|---|
DOI: | 10.48550/arxiv.1703.08471 |