Learning Filterbanks from Raw Speech for Phone Recognition
We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining...
Saved in:
Published in | 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5509 - 5513 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.04.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD- filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP.2018.8462015 |