Deep CNNs With Self-Attention for Speaker Identification

Most current works on speaker identification are based on i-vector methods; however, there is a marked shift from the traditional i-vector to deep learning methods, especially in the form of convolutional neural networks (CNNs). Rather than designing features and a subsequent individual classificati...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 7; pp. 85327 - 85337
Main Authors	An, Nguyen Nang, Thanh, Nguyen Quang, Liu, Yanbing
Format	Journal Article
Language	English
Published	Piscataway IEEE 2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Artificial neural networks Communications technology deep neural networks embedding learning Feature recognition Hidden Markov models Identification Identification methods Machine learning Neural networks self-attention Speaker identification Speech recognition Task analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Most current works on speaker identification are based on i-vector methods; however, there is a marked shift from the traditional i-vector to deep learning methods, especially in the form of convolutional neural networks (CNNs). Rather than designing features and a subsequent individual classification model, we address the problem by learning features and recognition systems using deep neural networks. Based on the deep convolutional neural network (CNN), this paper presents a novel text-independent speaker identification method for speaker separation. Specifically, this paper is based on the two representative CNNs, called the visual geometry group (VGG) nets and residual neural networks (ResNets). Unlike prior deep neural network-based speaker identification methods that usually rely on a temporal maximum or average pooling across all time steps to map variable-length utterances to a fixed-dimension vector, this paper equips these two CNNs with a structured self-attention mechanism to learn a weighted average across all time steps. Using the structured self-attention layer with multiple attention hops, the proposed deep CNN network is not only capable of handling variable-length segments but also able to learn speaker characteristics from different aspects of the input sequence. The experimental results on the speaker identification benchmark database, VoxCeleb demonstrate the superiority of the proposed method over the traditional i-vector-based methods and the other strong CNN baselines. In addition, the results suggest that it is possible to cluster unknown speakers using the activation of an upper layer of a pre-trained identification CNN as a speaker embedding vector.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2019.2917470