XMeCap: Meme Caption Generation with Sub-Image Adaptability

Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on...

Full description

Saved in:

Bibliographic Details
Main Authors	Chen, Yuyan, Yan, Songzhou, Zhu, Zhihong, Li, Zhixu, Xiao, Yanghua
Format	Journal Article
Language	English
Published	24.07.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text

Cover

Loading…

Abstract	Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
AbstractList	Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
Author	Chen, Yuyan Zhu, Zhihong Li, Zhixu Yan, Songzhou Xiao, Yanghua
Author_xml	– sequence: 1 givenname: Yuyan surname: Chen fullname: Chen, Yuyan – sequence: 2 givenname: Songzhou surname: Yan fullname: Yan, Songzhou – sequence: 3 givenname: Zhihong surname: Zhu fullname: Zhu, Zhihong – sequence: 4 givenname: Zhixu surname: Li fullname: Li, Zhixu – sequence: 5 givenname: Yanghua surname: Xiao fullname: Xiao, Yanghua
BackLink	https://doi.org/10.48550/arXiv.2407.17152$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxMNczNDc0NeJksI7wTXVOLLBS8E3NTVUAskoy8_MU3FPzUosSwczyzJIMheDSJF3P3MT0VAXHFKCSxKTMnMySSh4G1rTEnOJUXijNzSDv5hri7KELtia-oCgzN7GoMh5kXTzYOmPCKgCLFTXI
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2407.17152
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2407_17152
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2407_171523
IEDL.DBID	GOX
IngestDate	Tue Sep 24 12:24:43 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2407_171523
OpenAccessLink	https://arxiv.org/abs/2407.17152
ParticipantIDs	arxiv_primary_2407_17152
PublicationCentury	2000
PublicationDate	2024-07-24
PublicationDateYYYYMMDD	2024-07-24
PublicationDate_xml	– month: 07 year: 2024 text: 2024-07-24 day: 24
PublicationDecade	2020
PublicationYear	2024
Score	3.8527136
SecondaryResourceType	preprint
Snippet	Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Title	XMeCap: Meme Caption Generation with Sub-Image Adaptability
URI	https://arxiv.org/abs/2407.17152
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxBT0IxDG6AExciUQIquoPXQej2WJ6eCBHR5OlFknd7edvrSzhgCKLRf2-3YfTCbdmarenSfG3XfQA3k6lBStBIH95KjUjSEqbSps5WVWqNCs3j2fN0udJPeZI3QPz-hSl3X-vPyA9s38c-3RhNDGNME5qIvmXr4SWPj5OBiusg_yfHMWaY-gcSixPoHKI7MYvX0YUGvZ3CXZ7RvNzeiow2JHjkbSEi33MY-lqoYA-Wjxv2bjGrWCTSZ3-fwfXi_nW-lOG4Yhu5IQqvSRE0UT1ocQZPfRCkXE21zw0cg4YzJepSJ9YpmxCpOh1A_9gu58eXLqCNjLC-0Ij6Elr73QcNGSH39iqY6QcFXmnT
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=XMeCap%3A+Meme+Caption+Generation+with+Sub-Image+Adaptability&rft.au=Chen%2C+Yuyan&rft.au=Yan%2C+Songzhou&rft.au=Zhu%2C+Zhihong&rft.au=Li%2C+Zhixu&rft.date=2024-07-24&rft_id=info:doi/10.48550%2Farxiv.2407.17152&rft.externalDocID=2407_17152