Learning incremental audio–visual representation for continual multimodal understanding

Deep learning methods have demonstrated remarkable success in processing static datasets for various video tasks. However, when confronted with continuous data streams, these approaches often encounter the challenge of catastrophic forgetting. This phenomenon leads to a significant decline in overal...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 304; p. 112513
Main Authors	Zhu, Boqing, Wang, Changjian, Xu, Kele, Feng, Dawei, Zhou, Zemin, Zhu, Xiaoqian
Format	Journal Article
Language	English
Published	Elsevier B.V 25.11.2024
Subjects	Audio–visual representation Class incremental Continual learning Distillation Distillation Audio–visual representation Continual learning Class incremental
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep learning methods have demonstrated remarkable success in processing static datasets for various video tasks. However, when confronted with continuous data streams, these approaches often encounter the challenge of catastrophic forgetting. This phenomenon leads to a significant decline in overall performance when learning new classes incrementally. Moreover, existing methods tend to overlook the correlation between audio and visual modalities in video incremental learning, despite their joint significance in scene comprehension. How to continuously learn from new classes while maintaining the knowledge of old videos with limited storage and computing resources is becoming imperative in the field of multimodal learning. In this paper, we introduce CavRL, a pioneering benchmark for audio–visual representation learning under class incremental scenarios. To mitigate catastrophic forgetting, we propose a rehearsal-based training approach that leverages a small exemplar set from previous classes. Our approach constrains the memory buffer within strict storage limits, optimizing exemplar selection by learning correlative audio–visual representations. Additionally, we employ a distillation method to mitigate forgetting in a self-supervised manner. Evaluations on two prevalent multimodal tasks: audio–visual event classification and audio–visual speaker recognition, which demonstrate that CavRL outperforms existing state-of-the-art incremental learning methods across various settings. We anticipate that CavRL will significantly advance research in continual multimodal learning. •Catastrophic Forgetting Challenge in Multimodal Learning. Deep learning grapples with catastrophic forgetting in continuous mulitimodal data.•Consideration Audio–Visual Correlation. Consideration relationship between audio and visual modalities in continual learning.•Self-Supervised Distillation to Alleviate Forgetting. CavRL implements a self-supervised distillation method to reduce forgetting.•CavRL Benchmark. CavRL is a new benchmark for audio–visual learning in class incremental scenarios.
ISSN:	0950-7051
DOI:	10.1016/j.knosys.2024.112513