Environmental sound classification using a regularized deep convolutional neural network with data augmentation
The adoption of the environmental sound classification (ESC) tasks increases very rapidly over recent years due to its broad range of applications in our daily routine life. ESC is also known as Sound Event Recognition (SER) which involves the context of recognizing the audio stream, related to vari...
Saved in:
Published in | Applied acoustics Vol. 167; p. 107389 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.10.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The adoption of the environmental sound classification (ESC) tasks increases very rapidly over recent years due to its broad range of applications in our daily routine life. ESC is also known as Sound Event Recognition (SER) which involves the context of recognizing the audio stream, related to various environmental sounds. Some frequent and common aspects like non-uniform distance between acoustic source and microphone, the difference in the framework, presence of numerous sounds sources in audio recordings and overlapping various sound events make this ESC problem much complex and complicated. This study is to employ deep convolutional neural networks (DCNN) with regularization and data enhancement with basic audio features that have verified to be efficient on ESC tasks. In this study, the performance of DCNN with max-pooling (Model-1) and without max-pooling (Model-2) function are examined. Three audio attribute extraction techniques, Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC) and Log-Mel, are considered for the ESC-10, ESC-50, and Urban sound (US8K) datasets. Furthermore, to avoid the risk of overfitting due to limited numbers of data, this study also introduces offline data augmentation techniques to enhance the used datasets with a combination of L2 regularization. The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets. For ESC-10, ESC-50 and US8K, the highest achieved accuracies are 94.94%, 89.28%, and 95.37% respectively. The experimental results show that the proposed approach can accomplish the best performance on environment sound classification problems. |
---|---|
ISSN: | 0003-682X 1872-910X |
DOI: | 10.1016/j.apacoust.2020.107389 |