Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new i...

Full description

Saved in:
Bibliographic Details
Main Authors Mao, Yuxin, Zhang, Jing, Xiang, Mochu, Lv, Yunqiu, Zhong, Yiran, Dai, Yuchao
Format Journal Article
LanguageEnglish
Published 31.07.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
AbstractList We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
Author Xiang, Mochu
Zhang, Jing
Dai, Yuchao
Mao, Yuxin
Zhong, Yiran
Lv, Yunqiu
Author_xml – sequence: 1
  givenname: Yuxin
  surname: Mao
  fullname: Mao, Yuxin
– sequence: 2
  givenname: Jing
  surname: Zhang
  fullname: Zhang, Jing
– sequence: 3
  givenname: Mochu
  surname: Xiang
  fullname: Xiang, Mochu
– sequence: 4
  givenname: Yunqiu
  surname: Lv
  fullname: Lv, Yunqiu
– sequence: 5
  givenname: Yiran
  surname: Zhong
  fullname: Zhong, Yiran
– sequence: 6
  givenname: Yuchao
  surname: Dai
  fullname: Dai, Yuchao
BackLink https://doi.org/10.48550/arXiv.2307.16579$$DView paper in arXiv
BookMark eNotj81OwzAQhH2AAxQegBN-gQQnjmP7WAVaKkXqob1HG_8gS62NHCeCt8dt2cvsjGZX-h7RnQ_eIPRSkbIRjJE3iD9uKWtKeFm1jMsHtOmCTxGm5BaD865dcsHDCfeQjE_43Vk7TznCNkS8nrULxeKmOTcO5uucK3A5eEL3Fk6Tef7XFTpuPo7dZ9Hvt7tu3RfQclm0TIwCLNdGakFGqyQbFYG6spQTScHWjahlxVVDACjLQyQZs9cKJJeKrtDr7e0VZPiO7gzxd7gADVcg-gceaUhk
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2307.16579
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2307_16579
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a679-658b8af7de9d80bfc95bc0a21f37093af2482917c40aa35555090b17cdca979c3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:48:35 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a679-658b8af7de9d80bfc95bc0a21f37093af2482917c40aa35555090b17cdca979c3
OpenAccessLink https://arxiv.org/abs/2307.16579
ParticipantIDs arxiv_primary_2307_16579
PublicationCentury 2000
PublicationDate 2023-07-31
PublicationDateYYYYMMDD 2023-07-31
PublicationDate_xml – month: 07
  year: 2023
  text: 2023-07-31
  day: 31
PublicationDecade 2020
PublicationYear 2023
Score 1.894267
SecondaryResourceType preprint
Snippet We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computer Vision and Pattern Recognition
Computer Science - Multimedia
Computer Science - Sound
Title Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
URI https://arxiv.org/abs/2307.16579
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8MwELbaTiwIBKg85YHVEDtJbY8VUCrEY6BI3arzC2UgRU1S8fM5J0GwsPlxy50l3-fz3XeEXGrvpNPWsjzowDIugCnnc8YRmlrjMql5LBR-ep7M37KHZb4cEPpTCwObr2Lb8QOb6jpmKV_xSS71kAyFiClb9y_L7nOypeLq5X_lEGO2S3-cxGyP7Pbojk6749gnA18ekFlkgNpAFS8WimNXdPE3-og4r6zpbRFCE4NWFAEknTauWLNtUTUo8erfP_rioPKQLGZ3i5s569sXMJjImFOijIIgnddOJSZYnRubgOAhlYlOIYhMCXws2SwBQK-P2ujE4NxZ0FLb9IiMynXpx4R6XIy886kSPAMBinMfXC6Nk-hewByTcav06rNjqFhFe6xae5z8v3VKdmLv9C5QeUZG9abx5-hha3PRmvkbjD98zA
link.rule.ids 228,230,786,891
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Contrastive+Conditional+Latent+Diffusion+for+Audio-visual+Segmentation&rft.au=Mao%2C+Yuxin&rft.au=Zhang%2C+Jing&rft.au=Xiang%2C+Mochu&rft.au=Lv%2C+Yunqiu&rft.date=2023-07-31&rft_id=info:doi/10.48550%2Farxiv.2307.16579&rft.externalDocID=2307_16579