State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations
•Neural network embeddings become the new state-of-the-art in speaker recognition evaluations, improving i-vector performance by 2 in some conditions.•Comparing network architectures for x-vectors, factorized TDNN performed the best in a moderately large setup. However, E-TDNN can be also competitiv...
Saved in:
Published in | Computer speech & language Vol. 60; p. 101026 |
---|---|
Main Authors | , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.03.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •Neural network embeddings become the new state-of-the-art in speaker recognition evaluations, improving i-vector performance by 2 in some conditions.•Comparing network architectures for x-vectors, factorized TDNN performed the best in a moderately large setup. However, E-TDNN can be also competitive with a larger training setup.•Comparing pooling methods, learnable dictionary encoder performed the best indicating that we can take advantage of multi-modal frame-level hidden representations.•Angular-margin based training objectives performed better in-domain conditions but not in domain mismatched conditions.•Calibration in a new domain can be achieved by MAP adaptation of out-of-domain score distribution to the new domain using a very limited number of in-domain recordings.
We present a thorough analysis of the systems developed by the JHU-MIT consortium in the context of NIST speaker recognition evaluation 2018. In the previous NIST evaluation, in 2016, i-vectors were the speaker recognition state-of-the-art. However now, neural network embeddings (a.k.a. x-vectors) rise as the best performing approach. We show that in some conditions, x-vectors’ detection error reduces by 2 w.r.t. i-vectors. In this work, we experimented on the Speakers In The Wild evaluation (SITW), NIST SRE18 VAST (Video Annotation for Speech Technology), and SRE18 CMN2 (Call My Net 2, telephone Tunisian Arabic) to compare network architectures, pooling layers, training objectives, back-end adaptation methods, and calibration techniques. x-Vectors based on factorized and extended TDNN networks achieved performance without parallel on SITW and CMN2 data. However for VAST, performance was significantly worse than for SITW. We noted that the VAST audio quality was severely degraded compared to the SITW, even though they both consist of Internet videos. This degradation caused strong domain mismatch between training and VAST data. Due to this mismatch, large networks performed just slightly better than smaller networks. This also complicated VAST calibration. However, we managed to calibrate VAST by adapting SITW scores distribution to VAST, using a small amount of in-domain development data.
Regarding pooling methods, learnable dictionary encoder performed the best. This suggests that representations learned by x-vector encoders are multi-modal. Maximum margin losses were better than cross-entropy for in-domain data but not in VAST mismatched data. We also analyzed back-end adaptation methods in CMN2. PLDA semi-supervised adaptation and adaptive score normalization (AS-Norm) yielded significant improvements. However, results were still worse than in English in-domain conditions like SITW.
We conclude that x-vectors have become the new state-of-the-art in speaker recognition. However, their advantages reduce in cases of strong domain mismatch. We need to investigate domain adaptation and domain invariant training approaches to improve performance in all conditions. Also, speech enhancement techniques with a focus on improving the speaker recognition performance could be of great help. |
---|---|
ISSN: | 0885-2308 1095-8363 |
DOI: | 10.1016/j.csl.2019.101026 |