Improving Audio Classification Method by Combining Self-Supervision with Knowledge Distillation

The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains. In this regard, this article prop...

Full description

Saved in:

Bibliographic Details
Published in	Electronics (Basel) Vol. 13; no. 1; p. 52
Main Authors	Gong, Xuchao, Duan, Hongjie, Yang, Yaozhong, Tan, Lizhuang, Wang, Jian, Vasilakos, Athanasios V.
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.01.2024
Subjects	Algorithms Audio data Automatic classification Classification Datasets Distillation Frequency domain analysis Knowledge Learning Machine learning Neural networks Reconstruction Regularization methods Semantics Signal processing Sound Teachers China
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains. In this regard, this article proposes a self-supervised method combined with knowledge distillation to further improve the performance of audio classification tasks. Firstly, considering the particularity of the two-dimensional audio spectrum, both self-supervised strategy construction is carried out in a single dimension in the time and frequency domains, and self-supervised construction is carried out in the joint dimension of time and frequency. Effectively learn audio spectrum details and key discriminative information through information reconstruction, comparative learning, and other methods. Secondly, in terms of feature self-supervision, two learning strategies for teacher-student models are constructed, which are internal to the model and based on knowledge distillation. Fitting the teacher’s model feature expression ability, further enhances the generalization of audio classification. Comparative experiments were conducted using the AudioSet dataset, ESC50 dataset, and VGGSound dataset. The results showed that the algorithm proposed in this paper has a 0.5% to 1.3% improvement in recognition accuracy compared to the optimal method based on audio single mode.
ISSN:	2079-9292 2079-9292
DOI:	10.3390/electronics13010052