A Lightweight CNN-Conformer Model for Automatic Speaker Verification

Recently, Conformer has achieved tremendous success in speaker verification task. It demonstrates that Transformer-based model can achieve remarkable performance in this domain, bypassing the need for intricate pre-training procedures. However, its special macaron-style feed-forward module introduce...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 31; pp. 1 - 5
Main Authors	Wang, Hao, Lin, Xiaobing, Zhang, Jiashu
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Computational modeling Computer architecture conformer Convolution Convolutional neural networks Feature extraction Lightweight lightweight model network architecture Reduction Speaker verification Task analysis Transformers Verification
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, Conformer has achieved tremendous success in speaker verification task. It demonstrates that Transformer-based model can achieve remarkable performance in this domain, bypassing the need for intricate pre-training procedures. However, its special macaron-style feed-forward module introduced prohibitive computing and memory overhead. Speaker verification is often applied in resource-constrained embedded environments like smartphones, where only low memory is available. In light of this, we proposed two approaches to compress the size of the Conformer-based system while maintaining its performance. First, we introduced a lightweight Convolutional Neural Network (CNN) front-end with channel-frequency attention to substitute shallow Conformer blocks. This substitution is aimed at extracting more informative speaker characteristics for subsequent processing. Secondly, we introduced a light Feed-forward Network (FFN) based on depth-wise separable convolution to decrease the model size of Conformer blocks. To better demonstrate the effectiveness of our model, we conducted the evaluation in three different test sets. By incorporating these two approaches, we achieved an Equal Error Rate (EER) of 0.61% on VoxCeleb-O, surpassing the previous state-of-the-art Transformer-based model MFA-Conformer. Moreover, our model has achieved a 60.6% reduction in parameters and a 36.8% reduction in FLOPs compared with MFA-Conformer.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2023.3342714