Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new i...

Full description

Saved in:

Bibliographic Details
Main Authors	Mao, Yuxin, Zhang, Jing, Xiang, Mochu, Lv, Yunqiu, Zhong, Yiran, Dai, Yuchao
Format	Journal Article
Language	English
Published	31.07.2023
Subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Multimedia Computer Science - Sound
Online Access	Get full text

Cover

Loading…

Abstract	We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
AbstractList	We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
Author	Xiang, Mochu Zhang, Jing Dai, Yuchao Mao, Yuxin Zhong, Yiran Lv, Yunqiu
Author_xml	– sequence: 1 givenname: Yuxin surname: Mao fullname: Mao, Yuxin – sequence: 2 givenname: Jing surname: Zhang fullname: Zhang, Jing – sequence: 3 givenname: Mochu surname: Xiang fullname: Xiang, Mochu – sequence: 4 givenname: Yunqiu surname: Lv fullname: Lv, Yunqiu – sequence: 5 givenname: Yiran surname: Zhong fullname: Zhong, Yiran – sequence: 6 givenname: Yuchao surname: Dai fullname: Dai, Yuchao
BackLink	https://doi.org/10.48550/arXiv.2307.16579$$DView paper in arXiv
BookMark	eNotj81OwzAQhH2AAxQegBN-gQQnjmP7WAVaKkXqob1HG_8gS62NHCeCt8dt2cvsjGZX-h7RnQ_eIPRSkbIRjJE3iD9uKWtKeFm1jMsHtOmCTxGm5BaD865dcsHDCfeQjE_43Vk7TznCNkS8nrULxeKmOTcO5uucK3A5eEL3Fk6Tef7XFTpuPo7dZ9Hvt7tu3RfQclm0TIwCLNdGakFGqyQbFYG6spQTScHWjahlxVVDACjLQyQZs9cKJJeKrtDr7e0VZPiO7gzxd7gADVcg-gceaUhk
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2307.16579
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2307_16579
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a679-658b8af7de9d80bfc95bc0a21f37093af2482917c40aa35555090b17cdca979c3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:48:35 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a679-658b8af7de9d80bfc95bc0a21f37093af2482917c40aa35555090b17cdca979c3
OpenAccessLink	https://arxiv.org/abs/2307.16579
ParticipantIDs	arxiv_primary_2307_16579
PublicationCentury	2000
PublicationDate	2023-07-31
PublicationDateYYYYMMDD	2023-07-31
PublicationDate_xml	– month: 07 year: 2023 text: 2023-07-31 day: 31
PublicationDecade	2020
PublicationYear	2023
Score	1.894267
SecondaryResourceType	preprint
Snippet	We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computer Vision and Pattern Recognition Computer Science - Multimedia Computer Science - Sound
Title	Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
URI	https://arxiv.org/abs/2307.16579
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8MwELbaTiwIBKg85YHVEDtJbY8VUCrEY6BI3arzC2UgRU1S8fM5J0GwsPlxy50l3-fz3XeEXGrvpNPWsjzowDIugCnnc8YRmlrjMql5LBR-ep7M37KHZb4cEPpTCwObr2Lb8QOb6jpmKV_xSS71kAyFiClb9y_L7nOypeLq5X_lEGO2S3-cxGyP7Pbojk6749gnA18ekFlkgNpAFS8WimNXdPE3-og4r6zpbRFCE4NWFAEknTauWLNtUTUo8erfP_rioPKQLGZ3i5s569sXMJjImFOijIIgnddOJSZYnRubgOAhlYlOIYhMCXws2SwBQK-P2ujE4NxZ0FLb9IiMynXpx4R6XIy886kSPAMBinMfXC6Nk-hewByTcav06rNjqFhFe6xae5z8v3VKdmLv9C5QeUZG9abx5-hha3PRmvkbjD98zA
link.rule.ids	228,230,786,891
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Contrastive+Conditional+Latent+Diffusion+for+Audio-visual+Segmentation&rft.au=Mao%2C+Yuxin&rft.au=Zhang%2C+Jing&rft.au=Xiang%2C+Mochu&rft.au=Lv%2C+Yunqiu&rft.date=2023-07-31&rft_id=info:doi/10.48550%2Farxiv.2307.16579&rft.externalDocID=2307_16579