A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a mann...

Full description

Saved in:

Bibliographic Details
Main Authors	Sangkloy, Patsorn, Jitkrittum, Wittawat, Yang, Diyi, Hays, James
Format	Journal Article
Language	English
Published	05.08.2022
Subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Abstract	We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
AbstractList	We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
Author	Yang, Diyi Sangkloy, Patsorn Hays, James Jitkrittum, Wittawat
Author_xml	– sequence: 1 givenname: Patsorn surname: Sangkloy fullname: Sangkloy, Patsorn – sequence: 2 givenname: Wittawat surname: Jitkrittum fullname: Jitkrittum, Wittawat – sequence: 3 givenname: Diyi surname: Yang fullname: Yang, Diyi – sequence: 4 givenname: James surname: Hays fullname: Hays, James
BackLink	https://doi.org/10.48550/arXiv.2208.03354$$DView paper in arXiv
BookMark	eNotj81Kw0AUhWehC60-gCvnBRInmZ_cuitFbaAgaMBluM69Y4JtIpNY69vbtK4OfJxz4LsUZ13fsRA3mUoNWKvuMO7bXZrnClKltTUXYrWQr588-kaWg3zr49hIlFXTfw_Y0QRouJflFj9YvvAYW97hRv60h1rF-1FOpdP-SpwH3Ax8_Z8zUT0-VMtVsn5-KpeLdYKuMEmGMA-WnLPOuAPxECADROaCtHekSTGGDBQFR2EOheZ3cujRGO0V5Hombk-3R5X6K7ZbjL_1pFQflfQfbOZHxA
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2208.03354
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2208_03354
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a674-1a89f5d665646a67c8f818aaee7d3c6d3d0eaf180df6df9873ebd6aca443c0823
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:40:35 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a674-1a89f5d665646a67c8f818aaee7d3c6d3d0eaf180df6df9873ebd6aca443c0823
OpenAccessLink	https://arxiv.org/abs/2208.03354
ParticipantIDs	arxiv_primary_2208_03354
PublicationCentury	2000
PublicationDate	2022-08-05
PublicationDateYYYYMMDD	2022-08-05
PublicationDate_xml	– month: 08 year: 2022 text: 2022-08-05 day: 05
PublicationDecade	2020
PublicationYear	2022
Score	1.8514884
SecondaryResourceType	preprint
Snippet	We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Title	A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
URI	https://arxiv.org/abs/2208.03354
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NT8MwDLW2nbggEKDxqRy4VnRJmjbcJsTYkAAJithtSutEIMSGtoH4-dhpEVy4Os7BL0psxy8OwKmX3lMkERKLmFGCktvEpiEkmBcxfpAW-UL_5taMH_X1NJt2QPy8hXHLr5fPpj9wtTqTkqmOSmW6C10pmbJ1dTdtipOxFVer_6tHMWYU_XESoy3YbKM7MWyWYxs6fr4D46F4eGV0xGQlnrhOIpwonxfMh0EW4OpcTN5oY4v7-L8VLb7g-1FR0skpWKmZvwvl6LK8GCftBwaJM7lOBq6wIUNDJmtDkroI5B6d8z5HVRtUmHoXBkWKwWCg5F_5Co2rndaq5grYHvTmi7nvg7A6JfSUDNxLxaehsjTDq3zgaq0xpPvQj2bP3pseFTNGZBYROfh_6BA2JLP5mQGRHUFvvfzwx-Rj19VJBPobmvF7Wg
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Sketch+Is+Worth+a+Thousand+Words%3A+Image+Retrieval+with+Text+and+Sketch&rft.au=Sangkloy%2C+Patsorn&rft.au=Jitkrittum%2C+Wittawat&rft.au=Yang%2C+Diyi&rft.au=Hays%2C+James&rft.date=2022-08-05&rft_id=info:doi/10.48550%2Farxiv.2208.03354&rft.externalDocID=2208_03354