Improving Audio Classification Method by Combining Self-Supervision with Knowledge Distillation

The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains. In this regard, this article prop...

Full description

Saved in:
Bibliographic Details
Published inElectronics (Basel) Vol. 13; no. 1; p. 52
Main Authors Gong, Xuchao, Duan, Hongjie, Yang, Yaozhong, Tan, Lizhuang, Wang, Jian, Vasilakos, Athanasios V.
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 01.01.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains. In this regard, this article proposes a self-supervised method combined with knowledge distillation to further improve the performance of audio classification tasks. Firstly, considering the particularity of the two-dimensional audio spectrum, both self-supervised strategy construction is carried out in a single dimension in the time and frequency domains, and self-supervised construction is carried out in the joint dimension of time and frequency. Effectively learn audio spectrum details and key discriminative information through information reconstruction, comparative learning, and other methods. Secondly, in terms of feature self-supervision, two learning strategies for teacher-student models are constructed, which are internal to the model and based on knowledge distillation. Fitting the teacher’s model feature expression ability, further enhances the generalization of audio classification. Comparative experiments were conducted using the AudioSet dataset, ESC50 dataset, and VGGSound dataset. The results showed that the algorithm proposed in this paper has a 0.5% to 1.3% improvement in recognition accuracy compared to the optimal method based on audio single mode.
ISSN:2079-9292
2079-9292
DOI:10.3390/electronics13010052