Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs

Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate the...

Full description

Saved in:
Bibliographic Details
Main Authors Zhang, Jiarui, Khayatkhoei, Mahyar, Chhikara, Prateek, Ilievski, Filip
Format Journal Article
LanguageEnglish
Published 24.10.2023
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2310.16033

Cover

Loading…
Abstract Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether MLLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to 46% with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose five automatic visual cropping methods -- leveraging either external localization models or the decision process of the given MLLM itself -- as inference time mechanisms to improve the zero-shot performance of MLLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that MLLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. To facilitate further investigation of MLLMs' behaviors, our code and data are publicly released.
AbstractList Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether MLLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to 46% with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose five automatic visual cropping methods -- leveraging either external localization models or the decision process of the given MLLM itself -- as inference time mechanisms to improve the zero-shot performance of MLLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that MLLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. To facilitate further investigation of MLLMs' behaviors, our code and data are publicly released.
Author Khayatkhoei, Mahyar
Zhang, Jiarui
Chhikara, Prateek
Ilievski, Filip
Author_xml – sequence: 1
  givenname: Jiarui
  surname: Zhang
  fullname: Zhang, Jiarui
– sequence: 2
  givenname: Mahyar
  surname: Khayatkhoei
  fullname: Khayatkhoei, Mahyar
– sequence: 3
  givenname: Prateek
  surname: Chhikara
  fullname: Chhikara, Prateek
– sequence: 4
  givenname: Filip
  surname: Ilievski
  fullname: Ilievski, Filip
BackLink https://doi.org/10.48550/arXiv.2310.16033$$DView paper in arXiv
BookMark eNqFjrsOgjAYhTvo4O0BnOwLgGDFuBovcYBEI3FwIQ1U-ZPSmv4F9O0ForPTSc75TvINSU9pJQiZ-p67XAeBN-fmBZW7YE3hrzzGBiSNdc1NhvQkTCqgAvWgl4JLSa-AJZd0JywHiRQUvQmjHcy1_W3nUqAFrehGYS1M-63B5jQqpYVCZw0ShhGOSf_OJYrJN0dkdtjH26PT6SRPAwU376TVSjot9p_4ALhkRTk
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2310.16033
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2310_16033
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2310_160333
IEDL.DBID GOX
IngestDate Tue Jul 22 22:56:29 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2310_160333
OpenAccessLink https://arxiv.org/abs/2310.16033
ParticipantIDs arxiv_primary_2310_16033
PublicationCentury 2000
PublicationDate 2023-10-24
PublicationDateYYYYMMDD 2023-10-24
PublicationDate_xml – month: 10
  year: 2023
  text: 2023-10-24
  day: 24
PublicationDecade 2020
PublicationYear 2023
Score 3.7027183
SecondaryResourceType preprint
Snippet Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Title Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
URI https://arxiv.org/abs/2310.16033
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLa2nbggEKDx9oFrBH3QtccJGBPaAImBKi5Vk6ai0taipgN-Pna6CS67xlZkOYr92bEdgIuMvH6ulRYquMqEn9FVDHPHEZHUocwJkvge9ztPH4Pxq_8QX8cdwHUvTFr_FF_tfGBpLhl8cP7D87rQdV0u2bp_itvHSTuKa8X_x0cY0y79cxKjHdheoTsctsexCx1d7oGa2dJUg89cRFJwBI8vi3Q-x7fCLIn91pZxGixKfNd1JcxH1axpNiFJusNhab7t2EDk1CnaxtlFlRHLZDI1-3A-upvdjIUVK_lsZ0gkLHFiJfYOoEeRvu4D8tcfUnnkUwgWEHKSUZCGg0gGA7ICSjmH0N-0y9Fm0jFs8R_pbHBd_wR6Tb3Up-RJG3lm1fkLnmB4jw
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Towards+Perceiving+Small+Visual+Details+in+Zero-shot+Visual+Question+Answering+with+Multimodal+LLMs&rft.au=Zhang%2C+Jiarui&rft.au=Khayatkhoei%2C+Mahyar&rft.au=Chhikara%2C+Prateek&rft.au=Ilievski%2C+Filip&rft.date=2023-10-24&rft_id=info:doi/10.48550%2Farxiv.2310.16033&rft.externalDocID=2310_16033