Multilingual Text-Based Image Search Using Multimodal Embeddings

The explosion of data and information on the Internet calls for efficient and reliable information retrieval methods. While textual information retrieval systems have significantly improved, content-based image retrieval using text inputs requires further study and optimization. This research paper...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE 6th Conference on Information and Communication Technology (CICT) pp. 1 - 5
Main Authors	Pereira, Kristen, Parikh, Aman, Kumar, Pranav, Hole, Varsha
Format	Conference Proceeding
Language	English
Published	IEEE 18.11.2022
Subjects	CLIP Model Image retrieval Image Search Information retrieval Multimodal Neural Networks Natural Language Processing Quantization (signal) Runtime environment Semantics Timing Transformer Transformers
Online Access	Get full text
DOI	10.1109/CICT56698.2022.9997911

Cover

More Information
Summary:	The explosion of data and information on the Internet calls for efficient and reliable information retrieval methods. While textual information retrieval systems have significantly improved, content-based image retrieval using text inputs requires further study and optimization. This research paper proposes a system that uses CLIP (Contrastive Language-Image Pre-Training) model, which projects images and text into a multimodal embedding space to provide representations to compare the semantic meaning of words to embeddings of the image in a dataset. The output is a set of images closely resembling the text input, which is achieved through cosine similarity based on matrix operations. The models have also been optimized for use in production with the use of ONNX runtime is done to speed up inference timing. The application is full-stack and easily accessible, with a ReactJS frontend hosted on Netlify and Flask based Python backend hosted on AWS.
DOI:	10.1109/CICT56698.2022.9997911