Low-Complexity Audio Embedding Extractors

Solving tasks such as speaker recognition, music classification, or semantic audio event tagging with deep learning models typically requires computationally demanding networks. General-purpose audio embed dings (GPAEs) are dense representations of audio signals that allow lightweight, shallow class...

Full description

Saved in:

Bibliographic Details
Published in	2023 31st European Signal Processing Conference (EUSIPCO) pp. 451 - 455
Main Authors	Schmid, Florian, Koutini, Khaled, Widmer, Gerhard
Format	Conference Proceeding
Language	English
Published	EURASIP 04.09.2023
Subjects	Analytical models audio representation learning Complexity theory Computational modeling Feature extraction General-purpose audio embeddings HEAR benchmark low-complexity CNNs Signal processing Tagging Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Solving tasks such as speaker recognition, music classification, or semantic audio event tagging with deep learning models typically requires computationally demanding networks. General-purpose audio embed dings (GPAEs) are dense representations of audio signals that allow lightweight, shallow classifiers to tackle various audio tasks. The idea is that a single complex feature extractor would extract dense GPAEs, while shallow MLPs can produce task-specific predictions. If the extracted dense representations are general enough to allow the simple downstream classifiers to generalize to a variety of tasks in the audio domain, a single costly forward pass suffices to solve multiple tasks in parallel. In this work, we try to reduce the cost of GPAE extractors to make them suitable for resource-constrained devices. We use efficient MobileNets trained on AudioSet using Knowledge Distillation from a Transformer ensemble as efficient GPAE extractors. We explore how to obtain high-quality GPAEs from the model, study how model complexity relates to the quality of extracted GPAEs, and conclude that low-complexity models can generate competitive GPAEs, paving the way for analyzing audio streams on edge devices w.r.t. multiple audio classification and recognition tasks.
ISSN:	2076-1465
DOI:	10.23919/EUSIPCO58844.2023.10289815