A New Time–Frequency Attention Tensor Network for Language Identification

In this paper, we aim to improve traditional DNN x-vector language identification performance by employing wide residual networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network. Compared with conventional time attention, our method learns discriminati...

Full description

Saved in:
Bibliographic Details
Published inCircuits, systems, and signal processing Vol. 39; no. 5; pp. 2744 - 2758
Main Authors Miao, Xiaoxiao, McLoughlin, Ian, Yan, Yonghong
Format Journal Article
LanguageEnglish
Published New York Springer US 01.05.2020
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this paper, we aim to improve traditional DNN x-vector language identification performance by employing wide residual networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network. Compared with conventional time attention, our method learns discriminative weights for different frequency bands to generate weighted means and standard deviations for utterance-level classification. This mechanism enables the architecture to direct attention to important frequency bands rather than important time frames, as in traditional time attention methods. Furthermore, we then introduce a cross-layer frequency attention tensor network (CLF-ATN) which exploits information from different layers to recapture frame-level language characteristics that have been dropped by aggressive frequency pooling in lower layers. This effectively restores fine-grained discriminative language details. Finally, we explore the joint fusion of frame-level and frequency-band attention in a time–frequency attention network. Experimental results show that firstly, WRN can significantly outperform a traditional DNN x-vector implementation; secondly, the proposed frequency attention method is more effective than time attention; and thirdly, frequency–time score fusion can yield further improvement. Finally, extensive experiments on CLF-ATN demonstrate that it is able to improve discrimination by regaining dropped fine-grained frequency information, particularly for low-dimension frequency features.
ISSN:0278-081X
1531-5878
DOI:10.1007/s00034-019-01286-9