Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals
Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of the speech signals. The purpose of this paper is to evaluate h...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
11.03.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Multi-resolution spectro-temporal features of a speech signal represent how
the brain perceives sounds by tuning cortical cells to different spectral and
temporal modulations. These features produce a higher dimensional
representation of the speech signals. The purpose of this paper is to evaluate
how well the auditory cortex representation of speech signals contribute to
estimate articulatory features of those corresponding signals. Since obtaining
articulatory features from acoustic features of speech signals has been a
challenging topic of interest for different speech communities, we investigate
the possibility of using this multi-resolution representation of speech signals
as acoustic features. We used U. of Wisconsin X-ray Microbeam (XRMB) database
of clean speech signals to train a feed-forward deep neural network (DNN) to
estimate articulatory trajectories of six tract variables. The optimal set of
multi-resolution spectro-temporal features to train the model were chosen using
appropriate scale and rate vector parameters to obtain the best performing
model. Experiments achieved a correlation of 0.675 with ground-truth tract
variables. We compared the performance of this speech inversion system with
prior experiments conducted using Mel Frequency Cepstral Coefficients (MFCCs). |
---|---|
DOI: | 10.48550/arxiv.2203.05780 |