CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various d...

Full description

Saved in:

Bibliographic Details
Published in	Sensors (Basel, Switzerland) Vol. 24; no. 22; p. 7371
Main Authors	Zhao, Xiaoqing, Xu, Miaomiao, Silamu, Wushour, Li, Yanbing
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 19.11.2024 MDPI
Subjects	Artificial intelligence Computer vision Deep learning Image retrieval Language Llamas Natural language processing pre-trained language model scene text recognition Semantics vision-language model vision-language model pre-trained language model scene text recognition
Online Access	Get full text

Cover

Loading…

Abstract	This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP’s image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.
AbstractList	This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP’s image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models. This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.
Audience	Academic
Author	Zhao, Xiaoqing Xu, Miaomiao Silamu, Wushour Li, Yanbing
AuthorAffiliation	College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China
AuthorAffiliation_xml	– name: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China
Author_xml	– sequence: 1 givenname: Xiaoqing surname: Zhao fullname: Zhao, Xiaoqing – sequence: 2 givenname: Miaomiao surname: Xu fullname: Xu, Miaomiao – sequence: 3 givenname: Wushour surname: Silamu fullname: Silamu, Wushour – sequence: 4 givenname: Yanbing orcidid: 0000-0001-5368-6921 surname: Li fullname: Li, Yanbing
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/39599146$$D View this record in MEDLINE/PubMed
BookMark	eNpdkktvEzEQxy1URNvAgS-ALHGhhy1-7JNLFUU8Ii1QQeBqzXrHG0cbO3g3tJz46jhNiZrKh7FmfvrP85ycOO-QkJecXUpZsbeDSIUoZMGfkDOeijQphWAnD_6n5HwYVowJKWX5jJzKKqsqnuZn5O-snl8ndQ9reEen9Ave0OlmEzzoJTU-0O8aHdIF3o70G2rfOTta7-iNHZcU6HXAZBHAOmzpTzvESFKD67bQIf3sW-wpuPYRdww8J08N9AO-uLcT8uPD-8XsU1J__TifTetEp3k1JhKkFgwa0-rWCJQAmDUmywoEKbhkumDc6DiDFvOmMExgKdHkggPTvCgzOSHzvW7rYaU2wa4h_FEerLpz-NApCKPVPaomLaKOLhCNSauqrAphTC51kwLnuUmj1tVea7Nt1tjGCY0B-iPR44izS9X534rzrCp5uVN4c68Q_K8tDqNa20Fj34NDvx2U5FKmcUVSRPT1I3Tlt8HFWd1Rcf0smgm53FMdxA6sMz4m1vG1uLY6Xoux0T_dJS9YJncVvHrYw6H4_5cRgYs9oIMfhoDmgHCmdmnV4erkPySPx5w
Cites_doi	10.1016/j.neucom.2022.07.028 10.1109/CVPR42600.2020.01213 10.1007/978-3-031-19815-1_11 10.1109/CVPR52729.2023.00958 10.3390/s24196494 10.1016/j.eswa.2014.07.008 10.1109/ICCV51070.2023.01878 10.1109/ICDAR.2019.00250 10.1109/ICDAR.2019.00253 10.1016/j.eswa.2024.124551 10.1109/CVPR52688.2022.01553 10.1109/CVPR46437.2021.01505 10.1109/TPAMI.2021.3115428 10.1109/ICDAR.2013.221 10.1007/978-3-030-58607-2_38 10.1016/j.infrared.2024.105223 10.1109/TPAMI.2016.2646371 10.1109/ICCV.2013.76 10.1109/ICDAR.2019.00252 10.18653/v1/2021.emnlp-main.595 10.1109/ICDAR.2017.233 10.1109/CVPR46437.2021.00702 10.1109/ICCV48922.2021.01393 10.1109/CVPR.2018.00917 10.1007/s11704-015-4488-0 10.1109/ICCV51070.2023.01784 10.1109/ICDAR.2015.7333942 10.1109/CVPR46437.2021.00869 10.1109/ICDAR.2019.00130 10.5244/C.26.127 10.1007/978-3-030-86549-8_21 10.1109/ICDAR.2019.00254
ContentType	Journal Article
Copyright	COPYRIGHT 2024 MDPI AG 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. 2024 by the authors. 2024
Copyright_xml	– notice: COPYRIGHT 2024 MDPI AG – notice: 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: 2024 by the authors. 2024
DBID	AAYXX CITATION NPM 3V. 7X7 7XB 88E 8FI 8FJ 8FK ABUWG AFKRA AZQEC BENPR CCPQU DWQXO FYUFA GHDGH K9. M0S M1P PHGZM PHGZT PIMPY PJZUB PKEHL PPXIY PQEST PQQKQ PQUKI PRINS 7X8 5PM DOA
DOI	10.3390/s24227371
DatabaseName	CrossRef PubMed ProQuest Central (Corporate) Health & Medical Collection ProQuest Central (purchase pre-March 2016) Medical Database (Alumni Edition) ProQuest Hospital Collection Hospital Premium Collection (Alumni Edition) ProQuest Central (Alumni) (purchase pre-March 2016) ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials ProQuest Central ProQuest One Community College ProQuest Central Health Research Premium Collection Health Research Premium Collection (Alumni) ProQuest Health & Medical Complete (Alumni) Health & Medical Collection (Alumni) Medical Database ProQuest Central Premium ProQuest One Academic Publicly Available Content Database ProQuest Health & Medical Research Collection ProQuest One Academic Middle East (New) ProQuest One Health & Nursing ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China MEDLINE - Academic PubMed Central (Full Participant titles) Directory of Open Access Journals (DOAJ)
DatabaseTitle	CrossRef PubMed Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest Health & Medical Complete (Alumni) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest One Health & Nursing ProQuest Central China ProQuest Central Health Research Premium Collection Health and Medicine Complete (Alumni Edition) ProQuest Central Korea Health & Medical Research Collection ProQuest Central (New) ProQuest Medical Library (Alumni) ProQuest One Academic Eastern Edition ProQuest Hospital Collection Health Research Premium Collection (Alumni) ProQuest Hospital Collection (Alumni) ProQuest Health & Medical Complete ProQuest Medical Library ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) MEDLINE - Academic
DatabaseTitleList	Publicly Available Content Database CrossRef PubMed MEDLINE - Academic
Database_xml	– sequence: 1 dbid: DOA name: Directory of Open Access Journals (DOAJ) url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1424-8220
ExternalDocumentID	oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4 PMC11598184 A818470534 39599146 10_3390_s24227371
Genre	Journal Article
GroupedDBID	--- 123 2WC 53G 5VS 7X7 88E 8FE 8FG 8FI 8FJ AADQD AAHBH AAYXX ABDBF ABUWG ACUHS ADBBV ADMLS AENEX AFKRA AFZYC ALIPV ALMA_UNASSIGNED_HOLDINGS BENPR BPHCQ BVXVI CCPQU CITATION CS3 D1I DU5 E3Z EBD ESX F5P FYUFA GROUPED_DOAJ GX1 HH5 HMCUK HYE IAO ITC KQ8 L6V M1P M48 MODMG M~E OK1 OVT P2P P62 PHGZM PHGZT PIMPY PQQKQ PROAC PSQYO RNS RPM TUS UKHRP XSB ~8M 3V. ABJCF ARAPS HCIFZ KB. M7S NPM PDBOC PMFND 7XB 8FK AZQEC DWQXO K9. PJZUB PKEHL PPXIY PQEST PQUKI PRINS 7X8 5PM PUEGO
ID	FETCH-LOGICAL-c469t-3a3c20abfdcdf2e3aae5bf557ea32130c701fc273de6b7f02e83ef621a0c17853
IEDL.DBID	M48
ISSN	1424-8220
IngestDate	Wed Aug 27 01:23:07 EDT 2025 Thu Aug 21 18:34:59 EDT 2025 Fri Jul 11 11:02:50 EDT 2025 Fri Jul 25 23:32:21 EDT 2025 Tue Jun 10 20:59:51 EDT 2025 Wed Feb 19 02:03:57 EST 2025 Tue Jul 01 03:51:20 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	22
Keywords	vision-language model pre-trained language model scene text recognition
Language	English
License	https://creativecommons.org/licenses/by/4.0 Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c469t-3a3c20abfdcdf2e3aae5bf557ea32130c701fc273de6b7f02e83ef621a0c17853
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ORCID	0000-0001-5368-6921
OpenAccessLink	http://journals.scholarsportal.info/openUrl.xqy?doi=10.3390/s24227371
PMID	39599146
PQID	3133390013
PQPubID	2032333
ParticipantIDs	doaj_primary_oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4 pubmedcentral_primary_oai_pubmedcentral_nih_gov_11598184 proquest_miscellaneous_3133459932 proquest_journals_3133390013 gale_infotracacademiconefile_A818470534 pubmed_primary_39599146 crossref_primary_10_3390_s24227371
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	20241119
PublicationDateYYYYMMDD	2024-11-19
PublicationDate_xml	– month: 11 year: 2024 text: 20241119 day: 19
PublicationDecade	2020
PublicationPlace	Switzerland
PublicationPlace_xml	– name: Switzerland – name: Basel
PublicationTitle	Sensors (Basel, Switzerland)
PublicationTitleAlternate	Sensors (Basel)
PublicationYear	2024
Publisher	MDPI AG MDPI
Publisher_xml	– name: MDPI AG – name: MDPI
References	ref_50 ref_13 ref_11 ref_10 ref_51 Pan (ref_19) 2021; 44 Luo (ref_7) 2022; 508 ref_18 ref_17 ref_16 ref_15 ref_25 ref_24 ref_23 ref_22 ref_21 ref_20 ref_29 ref_28 ref_27 ref_26 Shi (ref_14) 2017; 39 ref_36 ref_35 Yu (ref_5) 2024; 255 ref_34 ref_33 ref_32 ref_31 ref_30 ref_39 ref_38 ref_37 Zhu (ref_12) 2016; 10 Risnumawan (ref_45) 2014; 41 ref_47 ref_46 ref_44 ref_43 ref_42 ref_41 ref_40 ref_1 ref_3 ref_2 ref_49 ref_48 Yu (ref_6) 2024; 138 ref_9 ref_8 ref_4
References_xml	– volume: 508 start-page: 293 year: 2022 ident: ref_7 article-title: Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning publication-title: Neurocomputing doi: 10.1016/j.neucom.2022.07.028 – ident: ref_1 doi: 10.1109/CVPR42600.2020.01213 – ident: ref_9 – ident: ref_25 doi: 10.1007/978-3-031-19815-1_11 – ident: ref_20 doi: 10.1109/CVPR52729.2023.00958 – ident: ref_32 – ident: ref_4 doi: 10.3390/s24196494 – ident: ref_26 – volume: 41 start-page: 8027 year: 2014 ident: ref_45 article-title: A robust arbitrary text detection system for natural scene images publication-title: Expert Syst. Appl. doi: 10.1016/j.eswa.2014.07.008 – ident: ref_51 doi: 10.1109/ICCV51070.2023.01878 – ident: ref_38 doi: 10.1109/ICDAR.2019.00250 – ident: ref_16 – ident: ref_40 doi: 10.1109/ICDAR.2019.00253 – volume: 255 start-page: 124551 year: 2024 ident: ref_5 article-title: Multitask learning for hand heat trace time estimation and identity recognition publication-title: Expert Syst. Appl. doi: 10.1016/j.eswa.2024.124551 – ident: ref_42 – ident: ref_28 doi: 10.1109/CVPR52688.2022.01553 – ident: ref_29 doi: 10.1109/CVPR46437.2021.01505 – ident: ref_31 – volume: 44 start-page: 7474 year: 2021 ident: ref_19 article-title: Exploiting deep generative prior for versatile image restoration and manipulation publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2021.3115428 – ident: ref_48 doi: 10.1109/ICDAR.2013.221 – ident: ref_10 – ident: ref_13 – ident: ref_17 – ident: ref_22 doi: 10.1007/978-3-030-58607-2_38 – volume: 138 start-page: 105223 year: 2024 ident: ref_6 article-title: Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation publication-title: Infrared Phys. Technol. doi: 10.1016/j.infrared.2024.105223 – ident: ref_30 – ident: ref_3 – ident: ref_34 – volume: 39 start-page: 2298 year: 2017 ident: ref_14 article-title: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2016.2646371 – ident: ref_47 doi: 10.1109/ICCV.2013.76 – ident: ref_37 doi: 10.1109/ICDAR.2019.00252 – ident: ref_8 doi: 10.18653/v1/2021.emnlp-main.595 – ident: ref_11 – ident: ref_35 doi: 10.1109/ICDAR.2017.233 – ident: ref_2 doi: 10.1109/CVPR46437.2021.00702 – ident: ref_23 doi: 10.1109/ICCV48922.2021.01393 – ident: ref_21 doi: 10.1109/CVPR.2018.00917 – volume: 10 start-page: 19 year: 2016 ident: ref_12 article-title: Scene text detection and recognition: Recent advances and future trends publication-title: Front. Comput. Sci. doi: 10.1007/s11704-015-4488-0 – ident: ref_27 doi: 10.1109/ICCV51070.2023.01784 – ident: ref_49 doi: 10.1109/ICDAR.2015.7333942 – ident: ref_50 – ident: ref_41 doi: 10.1109/CVPR46437.2021.00869 – ident: ref_33 – ident: ref_24 doi: 10.1109/ICDAR.2019.00130 – ident: ref_46 – ident: ref_44 doi: 10.5244/C.26.127 – ident: ref_15 – ident: ref_18 doi: 10.1007/978-3-030-86549-8_21 – ident: ref_36 – ident: ref_43 – ident: ref_39 doi: 10.1109/ICDAR.2019.00254
SSID	ssj0023338
Score	2.4335055
Snippet	This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval,...
SourceID	doaj pubmedcentral proquest gale pubmed crossref
SourceType	Open Website Open Access Repository Aggregation Database Index Database
StartPage	7371
SubjectTerms	Artificial intelligence Computer vision Deep learning Image retrieval Language Llamas Natural language processing pre-trained language model scene text recognition Semantics vision-language model
SummonAdditionalLinks	– databaseName: Directory of Open Access Journals (DOAJ) dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwELVQT3BAfBMoyCAkTlZjO4k33JaKqkIFIdii3qyxMxZIKIva7Zm_znOSXW3gwIVrPFpNPDN-87L2sxCvFtFyo0NUprGdqsiRopJKlRyHUgdiPSjefPjYnJ5X7y_qi72rvvKesFEeeJy4o1A5Z7vomFNCd75onUmpsTFUpHWTBiVQYN6WTE1Uy4J5jTpCFqT-6ApABJx2eoY-g0j_30vxHhbN90nuAc_JHXF76hjlcvT0rrjB_T1xa09H8L74dQx-rs4QXHojlxILl1xOWuESTan8gp9mucI6LD9vNwyte5m_wUqSny5ZrfJNEdzJr8NRc3U2fcWU-aq0H5L67g-7ucEDcX7ybnV8qqabFVQEHd4oSzaakkLqYpcMWyKuQ6prx2QNUC26UqeIGeu4CS6VhheWU2M0lVE7IPxDcdCve34spLGubSNbi1aqIkoIkAUlqkzJbdm1oRAvtzPuf44CGh7EI4fF78JSiLc5FjuDrHk9PEAm-CkT_L8yoRCvcyR9rkyEK9J0wAB-Zo0rv0RvUjksOrA83AbbTyV75S3YOpxCS1yIF7thFFv-B4V6Xl-PNlWNls4U4tGYGzufbYsB4E4hFrOsmb3UfKT__m0Q9EY9tNm5J_9jGp6KmwaNVz4vqdtDcbC5vOZnaJw24flQI78B4dkXWA priority: 102 providerName: Directory of Open Access Journals – databaseName: ProQuest Technology Collection dbid: 8FG link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1Nb9QwELWgXOBQUT4DLTIIiZPV2E7ihAvaViwVKgjBFvUWjZ1xQUJJu7s989cZJ950AxLX9SQ7yXjmzTj2G8Zel05jIa0TqtCNyMCAgBRS4Q3aVFpA2TPefPpcnJxlH8_z87jgtorbKjcxsQ_UTefCGvmhpmKK6nPKWN5dXonQNSp8XY0tNG6zO5KQJmzpKucfxoKLrigHNqFw6eGK4IjQ2sgJBvVU_f8G5C1Emu6W3IKf-X22G_NGPhsMvcduYfuA3dtiE3zIfh9TlS5OycTwls84hS8-i4zhnFJT_o1ujXxB0Zh_3Wwb6loeVmI58C9LFIvQLwIb_r0_cC5O41omDw3TfnFom7_kpgKP2Nn8_eL4RMT-CsJRUbwWGrRTKVjfuMYr1ACYW5_nBkErwjZnUukdvbEGC2t8qrDU6AslIXXSEM4_Zjtt1-JTxpU2VeVQa0qoMgBfVkZTYZSpFKu0qWzCXm3eeH050GjUVH4Es9SjWRJ2FGwxCgTm6_6HbnlRR0eqbWZIIWcQvadqjf5IeV9oZzOQsvBZwt4ES9bBP8lcDuIxA9IzMF3VM8pQMkOhhyT3N8auo-Ou6ptplrCX4zC5XPiOAi1214NMllNipxL2ZJgbo866ogFCn4SVk1kzeajpSPvzR0_rTV5RBeWe_V-v5-yuosQqnIeU1T7bWS-v8YASo7V90c_-PwtxDgw priority: 102 providerName: ProQuest
Title	CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
URI	https://www.ncbi.nlm.nih.gov/pubmed/39599146 https://www.proquest.com/docview/3133390013 https://www.proquest.com/docview/3133459932 https://pubmed.ncbi.nlm.nih.gov/PMC11598184 https://doaj.org/article/b4773dc7eeff4998972ff63cb4a116f4
Volume	24
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V9gIHxLuBsjIIiZMhsZ04QaqqbdWlQm1VlV20t8h2xoBUZWG7leDEX2ecx2oDHLnkEI-Sicfjmc-OvwF4lTuJWWIdF5msuDLacBObmHuNNk6swaRhvDk7z05m6sM8nW9BX2Oz68Drf0K7UE9qtrx68-P7zwNy-P2AOAmyv72mMENROJwk36GApEMhgzO13kwQkmBYSyo0FB-Eooax_-95eSMwDX-a3IhCk3twt0sf2bi1933YwvoB3NkgFXwIv44IrPNTsrR5x8aMZjE27ojDGWWo7CM9GtmUJmV22f89tKhZWJBlhl0skU9D2Qis2Kfm3Dk_7ZY0WaibdsVMXf0hNxR4BLPJ8fTohHdlFrgjbLzi0kgnYmN95SovUBqDqfVpqtFIQSHO6TjxjnqswsxqHwvMJfpMJCZ2iaZw_xi260WNu8CE1EXhUErKq5QxPi-0JHykRIxFXBU2gpd9j5ffWjaNklBIMEu5NksEh8EWa4FAgN3cWCw_l50_lVZpUshpRO8JtNGLhPeZdFaZJMm8iuB1sGQZBg6Zy5nutAHpGQivyjElKkrTDESSe72xy374lZKgOylF-XEEL9bN5HlhO8XUuLhpZVRK-Z2I4Ek7NtY6y4IaKAhFkA9GzeCjhi311y8Nuzc5RxGUe_o_uuEZ3BaUhYXDk0mxB9ur5Q0-pyxqZUdwS881XfPJ-xHsHB6fX1yOmhWJUeM9vwFRYiFJ
linkProvider	Scholars Portal
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6VcoAeEG8CBQwCcbKa2EmcICG0FJZdmlYVbFFvwXHGBQlly-5WiBP_iN_IOC82IHHrNR45k8zz82MG4EliJMZBYbiIZclDrTTXvva5VVj4QaExqCve7B_Ek6Pw3XF0vAG_ursw7lhl5xNrR13OjVsj35EEpgifU8by8vQbd12j3O5q10KjUYs9_PGdINvyxfQ1yfepEOM3s90Jb7sKcENQcMWllkb4urClKa1AqTVGhY0ihVoK8uhG-YE1FNVLjAtlfYGJRBuLQPvGtbKXNO8FuBgSNw7sJeO3PcAjDpOmepFjdWdJ4Y_mUcEg5tWtAf4NAGsRcHg6cy3cja_ClTZPZaNGsa7BBlbXYWuteuEN-LmbTQ95Riqln7MRI3fJRm2FckapMPtAUyObkfdn77tjSvOKuZVfptnhAvnM9afAkn2sL7jzrF07Za5B21emq_IvuiHBTTg6lz9_CzareYV3gAmp0tSglJTAhVrbJFWSgFgofEz9Mi08eNz98fy0KduRE9xxYsl7sXjwysmiJ3CVtusH88VJ3hpuXoSKGDIK0VpCh_QiYW0sTRHqIIht6MEzJ8nc-QMSl9HttQbi01XWykeUEYWKXB1RbnfCzltHscz_qLUHj_phMnG3b6MrnJ81NGFEiaTw4HajGz3PMqUBinYeJAOtGXzUcKT68rkuI05WmDrm7v6fr4dwaTLbz_JserB3Dy4LSurcXcwg3YbN1eIM71NStioe1JbA4NN5m95v6fRMXg
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwEB4tXQnBAfEmsIBBIE5WEzuNEySEuo9qy5aqWrqrvQXHGQMSSpa2K8SJ_8WvY5xHaUDittd45Ew8z8-xZwBexEZiFGSGi0jmPNRKc-1rn1uFmR9kGoOq4s37aXR4Er47G5xtwa_2Low7Vtn6xMpR56Vxe-R9SWCK8DllLH3bHIuY7Y_enn_jroOU-9PattOoVeQIf3wn-LZ8M94nWb8UYnQw3zvkTYcBbggWrrjU0ghfZzY3uRUotcZBZgcDhVoK8u5G-YE1FOFzjDJlfYGxRBuJQPvGtbWXNO8V2FYOFfVge_dgOjtewz3iN65rGTnG-0sKhjSTCjoRsGoU8G842IiH3bOaG8FvdBNuNFkrG9Zqdgu2sLgN1zdqGd6Bn3uT8YxPSMH0azZk5DzZsKlXzigxZh9oamRzWlB23B5aKgvm9oGZZrMF8rnrVoE5O62uu_NJs5PKXLu2r0wX-V90XYK7cHIpa38PekVZ4ANgQqokMSglpXOh1jZOlCRYFgofEz9PMg-etyuentdFPFICP04s6VosHuw6WawJXN3t6kG5-JQ2ZpxmoSKGjEK0lrAivUhYG0mThToIIht68MpJMnXegcRldHPJgfh0dbbSIeVHoSLHR5Q7rbDTxm0s0z9K7sGz9TAZvPuLowssL2qacEBppfDgfq0ba55lQgMU-zyIO1rT-ajuSPHlc1VUnGwyccw9_D9fT-EqmV06GU-PHsE1QRmeu5gZJDvQWy0u8DFlaKvsSWMKDD5etvX9BsNSUfA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CLIP-Llama%3A+A+New+Approach+for+Scene+Text+Recognition+with+a+Pre-Trained+Vision-Language+Model+and+a+Pre-Trained+Language+Model&rft.jtitle=Sensors+%28Basel%2C+Switzerland%29&rft.au=Xiaoqing+Zhao&rft.au=Miaomiao+Xu&rft.au=Wushour+Silamu&rft.au=Yanbing+Li&rft.date=2024-11-19&rft.pub=MDPI+AG&rft.eissn=1424-8220&rft.volume=24&rft.issue=22&rft.spage=7371&rft_id=info:doi/10.3390%2Fs24227371&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1424-8220&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1424-8220&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1424-8220&client=summon