Learning incremental audio–visual representation for continual multimodal understanding
Deep learning methods have demonstrated remarkable success in processing static datasets for various video tasks. However, when confronted with continuous data streams, these approaches often encounter the challenge of catastrophic forgetting. This phenomenon leads to a significant decline in overal...
Saved in:
Published in | Knowledge-based systems Vol. 304; p. 112513 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
25.11.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Deep learning methods have demonstrated remarkable success in processing static datasets for various video tasks. However, when confronted with continuous data streams, these approaches often encounter the challenge of catastrophic forgetting. This phenomenon leads to a significant decline in overall performance when learning new classes incrementally. Moreover, existing methods tend to overlook the correlation between audio and visual modalities in video incremental learning, despite their joint significance in scene comprehension. How to continuously learn from new classes while maintaining the knowledge of old videos with limited storage and computing resources is becoming imperative in the field of multimodal learning. In this paper, we introduce CavRL, a pioneering benchmark for audio–visual representation learning under class incremental scenarios. To mitigate catastrophic forgetting, we propose a rehearsal-based training approach that leverages a small exemplar set from previous classes. Our approach constrains the memory buffer within strict storage limits, optimizing exemplar selection by learning correlative audio–visual representations. Additionally, we employ a distillation method to mitigate forgetting in a self-supervised manner. Evaluations on two prevalent multimodal tasks: audio–visual event classification and audio–visual speaker recognition, which demonstrate that CavRL outperforms existing state-of-the-art incremental learning methods across various settings. We anticipate that CavRL will significantly advance research in continual multimodal learning.
•Catastrophic Forgetting Challenge in Multimodal Learning. Deep learning grapples with catastrophic forgetting in continuous mulitimodal data.•Consideration Audio–Visual Correlation. Consideration relationship between audio and visual modalities in continual learning.•Self-Supervised Distillation to Alleviate Forgetting. CavRL implements a self-supervised distillation method to reduce forgetting.•CavRL Benchmark. CavRL is a new benchmark for audio–visual learning in class incremental scenarios. |
---|---|
ISSN: | 0950-7051 |
DOI: | 10.1016/j.knosys.2024.112513 |