XMeCap: Meme Caption Generation with Sub-Image Adaptability

Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on...

Full description

Saved in:
Bibliographic Details
Main Authors Chen, Yuyan, Yan, Songzhou, Zhu, Zhihong, Li, Zhixu, Xiao, Yanghua
Format Journal Article
LanguageEnglish
Published 24.07.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
AbstractList Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
Author Chen, Yuyan
Zhu, Zhihong
Li, Zhixu
Yan, Songzhou
Xiao, Yanghua
Author_xml – sequence: 1
  givenname: Yuyan
  surname: Chen
  fullname: Chen, Yuyan
– sequence: 2
  givenname: Songzhou
  surname: Yan
  fullname: Yan, Songzhou
– sequence: 3
  givenname: Zhihong
  surname: Zhu
  fullname: Zhu, Zhihong
– sequence: 4
  givenname: Zhixu
  surname: Li
  fullname: Li, Zhixu
– sequence: 5
  givenname: Yanghua
  surname: Xiao
  fullname: Xiao, Yanghua
BackLink https://doi.org/10.48550/arXiv.2407.17152$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxMNczNDc0NeJksI7wTXVOLLBS8E3NTVUAskoy8_MU3FPzUosSwczyzJIMheDSJF3P3MT0VAXHFKCSxKTMnMySSh4G1rTEnOJUXijNzSDv5hri7KELtia-oCgzN7GoMh5kXTzYOmPCKgCLFTXI
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2407.17152
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2407_17152
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2407_171523
IEDL.DBID GOX
IngestDate Tue Sep 24 12:24:43 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2407_171523
OpenAccessLink https://arxiv.org/abs/2407.17152
ParticipantIDs arxiv_primary_2407_17152
PublicationCentury 2000
PublicationDate 2024-07-24
PublicationDateYYYYMMDD 2024-07-24
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-24
  day: 24
PublicationDecade 2020
PublicationYear 2024
Score 3.8527136
SecondaryResourceType preprint
Snippet Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
Title XMeCap: Meme Caption Generation with Sub-Image Adaptability
URI https://arxiv.org/abs/2407.17152
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxBT0IxDG6AExciUQIquoPXQej2WJ6eCBHR5OlFknd7edvrSzhgCKLRf2-3YfTCbdmarenSfG3XfQA3k6lBStBIH95KjUjSEqbSps5WVWqNCs3j2fN0udJPeZI3QPz-hSl3X-vPyA9s38c-3RhNDGNME5qIvmXr4SWPj5OBiusg_yfHMWaY-gcSixPoHKI7MYvX0YUGvZ3CXZ7RvNzeiow2JHjkbSEi33MY-lqoYA-Wjxv2bjGrWCTSZ3-fwfXi_nW-lOG4Yhu5IQqvSRE0UT1ocQZPfRCkXE21zw0cg4YzJepSJ9YpmxCpOh1A_9gu58eXLqCNjLC-0Ij6Elr73QcNGSH39iqY6QcFXmnT
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=XMeCap%3A+Meme+Caption+Generation+with+Sub-Image+Adaptability&rft.au=Chen%2C+Yuyan&rft.au=Yan%2C+Songzhou&rft.au=Zhu%2C+Zhihong&rft.au=Li%2C+Zhixu&rft.date=2024-07-24&rft_id=info:doi/10.48550%2Farxiv.2407.17152&rft.externalDocID=2407_17152