CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstr...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
01.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Self-supervised learning (SSL) is a powerful technique for learning
representations from unlabeled data. Transformer based models such as HuBERT,
which consist a feature extractor and transformer layers, are leading the field
in the speech domain. SSL models are fine-tuned on a wide range of downstream
tasks, which involves re-training the majority of the model for each task.
Previous studies have introduced applying adapters, which are small lightweight
modules commonly used in Natural Language Processing (NLP) to adapt pre-trained
models to new tasks. However, such efficient tuning techniques only provide
adaptation at the transformer layer, but failed to perform adaptation at the
feature extractor. In this paper, we propose CHAPTER, an efficient tuning
method specifically designed for SSL speech model, by applying CNN adapters at
the feature extractor. Using this method, we can only fine-tune fewer than 5%
of parameters per task compared to fully fine-tuning and achieve better and
more stable performance. We empirically found that adding CNN adapters to the
feature extractor can help the adaptation on emotion and speaker tasks. For
instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy
of ER is improved by 5%. |
---|---|
DOI: | 10.48550/arxiv.2212.01282 |