Learning Filterbanks from Raw Speech for Phone Recognition

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5509 - 5513
Main Authors	Zeghidour, Neil, Usunier, Nicolas, Kokkinos, Iasonas, Schaiz, Thomas, Synnaeve, Gabriel, Dupoux, Emmanuel
Format	Conference Proceeding
Language	English
Published	IEEE 01.04.2018
Subjects	Computer architecture Convolution Scattering Speech recognition Time-domain analysis Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD- filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.
ISSN:	2379-190X
DOI:	10.1109/ICASSP.2018.8462015