A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a mann...

Full description

Saved in:
Bibliographic Details
Main Authors Sangkloy, Patsorn, Jitkrittum, Wittawat, Yang, Diyi, Hays, James
Format Journal Article
LanguageEnglish
Published 05.08.2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
AbstractList We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
Author Yang, Diyi
Sangkloy, Patsorn
Hays, James
Jitkrittum, Wittawat
Author_xml – sequence: 1
  givenname: Patsorn
  surname: Sangkloy
  fullname: Sangkloy, Patsorn
– sequence: 2
  givenname: Wittawat
  surname: Jitkrittum
  fullname: Jitkrittum, Wittawat
– sequence: 3
  givenname: Diyi
  surname: Yang
  fullname: Yang, Diyi
– sequence: 4
  givenname: James
  surname: Hays
  fullname: Hays, James
BackLink https://doi.org/10.48550/arXiv.2208.03354$$DView paper in arXiv
BookMark eNotj81Kw0AUhWehC60-gCvnBRInmZ_cuitFbaAgaMBluM69Y4JtIpNY69vbtK4OfJxz4LsUZ13fsRA3mUoNWKvuMO7bXZrnClKltTUXYrWQr588-kaWg3zr49hIlFXTfw_Y0QRouJflFj9YvvAYW97hRv60h1rF-1FOpdP-SpwH3Ax8_Z8zUT0-VMtVsn5-KpeLdYKuMEmGMA-WnLPOuAPxECADROaCtHekSTGGDBQFR2EOheZ3cujRGO0V5Hombk-3R5X6K7ZbjL_1pFQflfQfbOZHxA
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2208.03354
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2208_03354
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a674-1a89f5d665646a67c8f818aaee7d3c6d3d0eaf180df6df9873ebd6aca443c0823
IEDL.DBID GOX
IngestDate Mon Jan 08 05:40:35 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a674-1a89f5d665646a67c8f818aaee7d3c6d3d0eaf180df6df9873ebd6aca443c0823
OpenAccessLink https://arxiv.org/abs/2208.03354
ParticipantIDs arxiv_primary_2208_03354
PublicationCentury 2000
PublicationDate 2022-08-05
PublicationDateYYYYMMDD 2022-08-05
PublicationDate_xml – month: 08
  year: 2022
  text: 2022-08-05
  day: 05
PublicationDecade 2020
PublicationYear 2022
Score 1.8514884
SecondaryResourceType preprint
Snippet We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
Title A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
URI https://arxiv.org/abs/2208.03354
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NT8MwDLW2nbggEKDxqRy4VnRJmjbcJsTYkAAJithtSutEIMSGtoH4-dhpEVy4Os7BL0psxy8OwKmX3lMkERKLmFGCktvEpiEkmBcxfpAW-UL_5taMH_X1NJt2QPy8hXHLr5fPpj9wtTqTkqmOSmW6C10pmbJ1dTdtipOxFVer_6tHMWYU_XESoy3YbKM7MWyWYxs6fr4D46F4eGV0xGQlnrhOIpwonxfMh0EW4OpcTN5oY4v7-L8VLb7g-1FR0skpWKmZvwvl6LK8GCftBwaJM7lOBq6wIUNDJmtDkroI5B6d8z5HVRtUmHoXBkWKwWCg5F_5Co2rndaq5grYHvTmi7nvg7A6JfSUDNxLxaehsjTDq3zgaq0xpPvQj2bP3pseFTNGZBYROfh_6BA2JLP5mQGRHUFvvfzwx-Rj19VJBPobmvF7Wg
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Sketch+Is+Worth+a+Thousand+Words%3A+Image+Retrieval+with+Text+and+Sketch&rft.au=Sangkloy%2C+Patsorn&rft.au=Jitkrittum%2C+Wittawat&rft.au=Yang%2C+Diyi&rft.au=Hays%2C+James&rft.date=2022-08-05&rft_id=info:doi/10.48550%2Farxiv.2208.03354&rft.externalDocID=2208_03354