Self Attention Networks in Speaker Recognition

Recently, there has been a significant surge of interest in Self-Attention Networks (SANs) based on the Transformer architecture. This can be attributed to their notable ability for parallelization and their impressive performance across various Natural Language Processing applications. On the other...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 13; no. 11; p. 6410
Main Authors	Safari, Pooyan, India, Miquel, Hernando, Javier
Format	Journal Article
Language	English
Published	Basel MDPI AG 24.05.2023
Subjects	Architecture Computational linguistics Efficiency Language Language processing Natural language interfaces Natural language processing Neural networks Rankings self-attention networks speaker embeddings speaker recognition Speech Speech recognition Storage area networks transformer Verification
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, there has been a significant surge of interest in Self-Attention Networks (SANs) based on the Transformer architecture. This can be attributed to their notable ability for parallelization and their impressive performance across various Natural Language Processing applications. On the other hand, the utilization of large-scale, multi-purpose language models trained through self-supervision is progressively more prevalent, for tasks like speech recognition. In this context, the pre-trained model, which has been trained on extensive speech data, can be fine-tuned for particular downstream tasks like speaker verification. These massive models typically rely on SANs as their foundational architecture. Therefore, studying the potential capabilities and training challenges of such models is of utmost importance for the future generation of speaker verification systems. In this direction, we propose a speaker embedding extractor based on SANs to obtain a discriminative speaker representation given non-fixed length speech utterances. With the advancements suggested in this work, we could achieve up to 41% relative performance improvement in terms of EER compared to the naive SAN which was proposed in our previous work. Moreover, we empirically show the training instability in such architectures in terms of rank collapse and further investigate the potential solutions to alleviate this shortcoming.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app13116410