Multilingual Text-Based Image Search Using Multimodal Embeddings

The explosion of data and information on the Internet calls for efficient and reliable information retrieval methods. While textual information retrieval systems have significantly improved, content-based image retrieval using text inputs requires further study and optimization. This research paper...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE 6th Conference on Information and Communication Technology (CICT) pp. 1 - 5
Main Authors Pereira, Kristen, Parikh, Aman, Kumar, Pranav, Hole, Varsha
Format Conference Proceeding
LanguageEnglish
Published IEEE 18.11.2022
Subjects
Online AccessGet full text
DOI10.1109/CICT56698.2022.9997911

Cover

More Information
Summary:The explosion of data and information on the Internet calls for efficient and reliable information retrieval methods. While textual information retrieval systems have significantly improved, content-based image retrieval using text inputs requires further study and optimization. This research paper proposes a system that uses CLIP (Contrastive Language-Image Pre-Training) model, which projects images and text into a multimodal embedding space to provide representations to compare the semantic meaning of words to embeddings of the image in a dataset. The output is a set of images closely resembling the text input, which is achieved through cosine similarity based on matrix operations. The models have also been optimized for use in production with the use of ONNX runtime is done to speed up inference timing. The application is full-stack and easily accessible, with a ReactJS frontend hosted on Netlify and Flask based Python backend hosted on AWS.
DOI:10.1109/CICT56698.2022.9997911