Dilated residual networks with multi-level attention for speaker verification

With the development of deep learning techniques, speaker verification (SV) systems based on deep neural network (DNN) achieve competitive performance compared with traditional i-vector-based works. Previous DNN-based SV methods usually employ time-delay neural network, limiting the extension of the...

Full description

Saved in:
Bibliographic Details
Published inNeurocomputing (Amsterdam) Vol. 412; pp. 177 - 186
Main Authors Wu, Yanfeng, Guo, Chenkai, Gao, Hongcan, Xu, Jing, Bai, Guangdong
Format Journal Article
LanguageEnglish
Published Elsevier B.V 28.10.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:With the development of deep learning techniques, speaker verification (SV) systems based on deep neural network (DNN) achieve competitive performance compared with traditional i-vector-based works. Previous DNN-based SV methods usually employ time-delay neural network, limiting the extension of the network for an effective representation. Besides, existing attention mechanisms used in DNN-based SV systems are only applied to a single level of network architectures, leading to insufficiently extraction of important features. To address above issues, we propose an effective deep speaker embedding architecture for SV, which combines a residual connection of one-dimensional dilated convolutional layers, called dilated residual networks (DRNs), with a multi-level attention model. The DRNs can not only capture long time-frequency context information of features, but also exploit information from multiple layers of DNN. In addition, the multi-level attention model, which consists of two-dimensional convolutional block attention modules employed at the frame level and the vector-based attention utilized at the pooling layer, can emphasize important features at multiple levels of DNN. Experiments conducted on NIST SRE 2016 dataset show that the proposed architecture achieves a superior equal error rate (EER) of 7.094% and a better detection cost function (DCF16) of 0.552 compared with state-of-the-art methods. Furthermore, the ablation experiments demonstrate the effectiveness of dilated convolutions and the multi-level attention on SV tasks.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2020.06.079