CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various d...
Saved in:
Published in | Sensors (Basel, Switzerland) Vol. 24; no. 22; p. 7371 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Switzerland
MDPI AG
19.11.2024
MDPI |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP’s image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models. |
---|---|
AbstractList | This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP’s image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models. This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models. |
Audience | Academic |
Author | Zhao, Xiaoqing Xu, Miaomiao Silamu, Wushour Li, Yanbing |
AuthorAffiliation | College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China |
AuthorAffiliation_xml | – name: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China |
Author_xml | – sequence: 1 givenname: Xiaoqing surname: Zhao fullname: Zhao, Xiaoqing – sequence: 2 givenname: Miaomiao surname: Xu fullname: Xu, Miaomiao – sequence: 3 givenname: Wushour surname: Silamu fullname: Silamu, Wushour – sequence: 4 givenname: Yanbing orcidid: 0000-0001-5368-6921 surname: Li fullname: Li, Yanbing |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39599146$$D View this record in MEDLINE/PubMed |
BookMark | eNpdkktvEzEQxy1URNvAgS-ALHGhhy1-7JNLFUU8Ii1QQeBqzXrHG0cbO3g3tJz46jhNiZrKh7FmfvrP85ycOO-QkJecXUpZsbeDSIUoZMGfkDOeijQphWAnD_6n5HwYVowJKWX5jJzKKqsqnuZn5O-snl8ndQ9reEen9Ave0OlmEzzoJTU-0O8aHdIF3o70G2rfOTta7-iNHZcU6HXAZBHAOmzpTzvESFKD67bQIf3sW-wpuPYRdww8J08N9AO-uLcT8uPD-8XsU1J__TifTetEp3k1JhKkFgwa0-rWCJQAmDUmywoEKbhkumDc6DiDFvOmMExgKdHkggPTvCgzOSHzvW7rYaU2wa4h_FEerLpz-NApCKPVPaomLaKOLhCNSauqrAphTC51kwLnuUmj1tVea7Nt1tjGCY0B-iPR44izS9X534rzrCp5uVN4c68Q_K8tDqNa20Fj34NDvx2U5FKmcUVSRPT1I3Tlt8HFWd1Rcf0smgm53FMdxA6sMz4m1vG1uLY6Xoux0T_dJS9YJncVvHrYw6H4_5cRgYs9oIMfhoDmgHCmdmnV4erkPySPx5w |
Cites_doi | 10.1016/j.neucom.2022.07.028 10.1109/CVPR42600.2020.01213 10.1007/978-3-031-19815-1_11 10.1109/CVPR52729.2023.00958 10.3390/s24196494 10.1016/j.eswa.2014.07.008 10.1109/ICCV51070.2023.01878 10.1109/ICDAR.2019.00250 10.1109/ICDAR.2019.00253 10.1016/j.eswa.2024.124551 10.1109/CVPR52688.2022.01553 10.1109/CVPR46437.2021.01505 10.1109/TPAMI.2021.3115428 10.1109/ICDAR.2013.221 10.1007/978-3-030-58607-2_38 10.1016/j.infrared.2024.105223 10.1109/TPAMI.2016.2646371 10.1109/ICCV.2013.76 10.1109/ICDAR.2019.00252 10.18653/v1/2021.emnlp-main.595 10.1109/ICDAR.2017.233 10.1109/CVPR46437.2021.00702 10.1109/ICCV48922.2021.01393 10.1109/CVPR.2018.00917 10.1007/s11704-015-4488-0 10.1109/ICCV51070.2023.01784 10.1109/ICDAR.2015.7333942 10.1109/CVPR46437.2021.00869 10.1109/ICDAR.2019.00130 10.5244/C.26.127 10.1007/978-3-030-86549-8_21 10.1109/ICDAR.2019.00254 |
ContentType | Journal Article |
Copyright | COPYRIGHT 2024 MDPI AG 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. 2024 by the authors. 2024 |
Copyright_xml | – notice: COPYRIGHT 2024 MDPI AG – notice: 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: 2024 by the authors. 2024 |
DBID | AAYXX CITATION NPM 3V. 7X7 7XB 88E 8FI 8FJ 8FK ABUWG AFKRA AZQEC BENPR CCPQU DWQXO FYUFA GHDGH K9. M0S M1P PHGZM PHGZT PIMPY PJZUB PKEHL PPXIY PQEST PQQKQ PQUKI PRINS 7X8 5PM DOA |
DOI | 10.3390/s24227371 |
DatabaseName | CrossRef PubMed ProQuest Central (Corporate) Health & Medical Collection ProQuest Central (purchase pre-March 2016) Medical Database (Alumni Edition) ProQuest Hospital Collection Hospital Premium Collection (Alumni Edition) ProQuest Central (Alumni) (purchase pre-March 2016) ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials ProQuest Central ProQuest One Community College ProQuest Central Health Research Premium Collection Health Research Premium Collection (Alumni) ProQuest Health & Medical Complete (Alumni) Health & Medical Collection (Alumni) Medical Database ProQuest Central Premium ProQuest One Academic Publicly Available Content Database ProQuest Health & Medical Research Collection ProQuest One Academic Middle East (New) ProQuest One Health & Nursing ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China MEDLINE - Academic PubMed Central (Full Participant titles) Directory of Open Access Journals (DOAJ) |
DatabaseTitle | CrossRef PubMed Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest Health & Medical Complete (Alumni) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest One Health & Nursing ProQuest Central China ProQuest Central Health Research Premium Collection Health and Medicine Complete (Alumni Edition) ProQuest Central Korea Health & Medical Research Collection ProQuest Central (New) ProQuest Medical Library (Alumni) ProQuest One Academic Eastern Edition ProQuest Hospital Collection Health Research Premium Collection (Alumni) ProQuest Hospital Collection (Alumni) ProQuest Health & Medical Complete ProQuest Medical Library ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) MEDLINE - Academic |
DatabaseTitleList | Publicly Available Content Database CrossRef PubMed MEDLINE - Academic |
Database_xml | – sequence: 1 dbid: DOA name: Directory of Open Access Journals (DOAJ) url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1424-8220 |
ExternalDocumentID | oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4 PMC11598184 A818470534 39599146 10_3390_s24227371 |
Genre | Journal Article |
GroupedDBID | --- 123 2WC 53G 5VS 7X7 88E 8FE 8FG 8FI 8FJ AADQD AAHBH AAYXX ABDBF ABUWG ACUHS ADBBV ADMLS AENEX AFKRA AFZYC ALIPV ALMA_UNASSIGNED_HOLDINGS BENPR BPHCQ BVXVI CCPQU CITATION CS3 D1I DU5 E3Z EBD ESX F5P FYUFA GROUPED_DOAJ GX1 HH5 HMCUK HYE IAO ITC KQ8 L6V M1P M48 MODMG M~E OK1 OVT P2P P62 PHGZM PHGZT PIMPY PQQKQ PROAC PSQYO RNS RPM TUS UKHRP XSB ~8M 3V. ABJCF ARAPS HCIFZ KB. M7S NPM PDBOC PMFND 7XB 8FK AZQEC DWQXO K9. PJZUB PKEHL PPXIY PQEST PQUKI PRINS 7X8 5PM PUEGO |
ID | FETCH-LOGICAL-c469t-3a3c20abfdcdf2e3aae5bf557ea32130c701fc273de6b7f02e83ef621a0c17853 |
IEDL.DBID | M48 |
ISSN | 1424-8220 |
IngestDate | Wed Aug 27 01:23:07 EDT 2025 Thu Aug 21 18:34:59 EDT 2025 Fri Jul 11 11:02:50 EDT 2025 Fri Jul 25 23:32:21 EDT 2025 Tue Jun 10 20:59:51 EDT 2025 Wed Feb 19 02:03:57 EST 2025 Tue Jul 01 03:51:20 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 22 |
Keywords | vision-language model pre-trained language model scene text recognition |
Language | English |
License | https://creativecommons.org/licenses/by/4.0 Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c469t-3a3c20abfdcdf2e3aae5bf557ea32130c701fc273de6b7f02e83ef621a0c17853 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ORCID | 0000-0001-5368-6921 |
OpenAccessLink | http://journals.scholarsportal.info/openUrl.xqy?doi=10.3390/s24227371 |
PMID | 39599146 |
PQID | 3133390013 |
PQPubID | 2032333 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4 pubmedcentral_primary_oai_pubmedcentral_nih_gov_11598184 proquest_miscellaneous_3133459932 proquest_journals_3133390013 gale_infotracacademiconefile_A818470534 pubmed_primary_39599146 crossref_primary_10_3390_s24227371 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 20241119 |
PublicationDateYYYYMMDD | 2024-11-19 |
PublicationDate_xml | – month: 11 year: 2024 text: 20241119 day: 19 |
PublicationDecade | 2020 |
PublicationPlace | Switzerland |
PublicationPlace_xml | – name: Switzerland – name: Basel |
PublicationTitle | Sensors (Basel, Switzerland) |
PublicationTitleAlternate | Sensors (Basel) |
PublicationYear | 2024 |
Publisher | MDPI AG MDPI |
Publisher_xml | – name: MDPI AG – name: MDPI |
References | ref_50 ref_13 ref_11 ref_10 ref_51 Pan (ref_19) 2021; 44 Luo (ref_7) 2022; 508 ref_18 ref_17 ref_16 ref_15 ref_25 ref_24 ref_23 ref_22 ref_21 ref_20 ref_29 ref_28 ref_27 ref_26 Shi (ref_14) 2017; 39 ref_36 ref_35 Yu (ref_5) 2024; 255 ref_34 ref_33 ref_32 ref_31 ref_30 ref_39 ref_38 ref_37 Zhu (ref_12) 2016; 10 Risnumawan (ref_45) 2014; 41 ref_47 ref_46 ref_44 ref_43 ref_42 ref_41 ref_40 ref_1 ref_3 ref_2 ref_49 ref_48 Yu (ref_6) 2024; 138 ref_9 ref_8 ref_4 |
References_xml | – volume: 508 start-page: 293 year: 2022 ident: ref_7 article-title: Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning publication-title: Neurocomputing doi: 10.1016/j.neucom.2022.07.028 – ident: ref_1 doi: 10.1109/CVPR42600.2020.01213 – ident: ref_9 – ident: ref_25 doi: 10.1007/978-3-031-19815-1_11 – ident: ref_20 doi: 10.1109/CVPR52729.2023.00958 – ident: ref_32 – ident: ref_4 doi: 10.3390/s24196494 – ident: ref_26 – volume: 41 start-page: 8027 year: 2014 ident: ref_45 article-title: A robust arbitrary text detection system for natural scene images publication-title: Expert Syst. Appl. doi: 10.1016/j.eswa.2014.07.008 – ident: ref_51 doi: 10.1109/ICCV51070.2023.01878 – ident: ref_38 doi: 10.1109/ICDAR.2019.00250 – ident: ref_16 – ident: ref_40 doi: 10.1109/ICDAR.2019.00253 – volume: 255 start-page: 124551 year: 2024 ident: ref_5 article-title: Multitask learning for hand heat trace time estimation and identity recognition publication-title: Expert Syst. Appl. doi: 10.1016/j.eswa.2024.124551 – ident: ref_42 – ident: ref_28 doi: 10.1109/CVPR52688.2022.01553 – ident: ref_29 doi: 10.1109/CVPR46437.2021.01505 – ident: ref_31 – volume: 44 start-page: 7474 year: 2021 ident: ref_19 article-title: Exploiting deep generative prior for versatile image restoration and manipulation publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2021.3115428 – ident: ref_48 doi: 10.1109/ICDAR.2013.221 – ident: ref_10 – ident: ref_13 – ident: ref_17 – ident: ref_22 doi: 10.1007/978-3-030-58607-2_38 – volume: 138 start-page: 105223 year: 2024 ident: ref_6 article-title: Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation publication-title: Infrared Phys. Technol. doi: 10.1016/j.infrared.2024.105223 – ident: ref_30 – ident: ref_3 – ident: ref_34 – volume: 39 start-page: 2298 year: 2017 ident: ref_14 article-title: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2016.2646371 – ident: ref_47 doi: 10.1109/ICCV.2013.76 – ident: ref_37 doi: 10.1109/ICDAR.2019.00252 – ident: ref_8 doi: 10.18653/v1/2021.emnlp-main.595 – ident: ref_11 – ident: ref_35 doi: 10.1109/ICDAR.2017.233 – ident: ref_2 doi: 10.1109/CVPR46437.2021.00702 – ident: ref_23 doi: 10.1109/ICCV48922.2021.01393 – ident: ref_21 doi: 10.1109/CVPR.2018.00917 – volume: 10 start-page: 19 year: 2016 ident: ref_12 article-title: Scene text detection and recognition: Recent advances and future trends publication-title: Front. Comput. Sci. doi: 10.1007/s11704-015-4488-0 – ident: ref_27 doi: 10.1109/ICCV51070.2023.01784 – ident: ref_49 doi: 10.1109/ICDAR.2015.7333942 – ident: ref_50 – ident: ref_41 doi: 10.1109/CVPR46437.2021.00869 – ident: ref_33 – ident: ref_24 doi: 10.1109/ICDAR.2019.00130 – ident: ref_46 – ident: ref_44 doi: 10.5244/C.26.127 – ident: ref_15 – ident: ref_18 doi: 10.1007/978-3-030-86549-8_21 – ident: ref_36 – ident: ref_43 – ident: ref_39 doi: 10.1109/ICDAR.2019.00254 |
SSID | ssj0023338 |
Score | 2.4335055 |
Snippet | This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval,... |
SourceID | doaj pubmedcentral proquest gale pubmed crossref |
SourceType | Open Website Open Access Repository Aggregation Database Index Database |
StartPage | 7371 |
SubjectTerms | Artificial intelligence Computer vision Deep learning Image retrieval Language Llamas Natural language processing pre-trained language model scene text recognition Semantics vision-language model |
SummonAdditionalLinks | – databaseName: Directory of Open Access Journals (DOAJ) dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwELVQT3BAfBMoyCAkTlZjO4k33JaKqkIFIdii3qyxMxZIKIva7Zm_znOSXW3gwIVrPFpNPDN-87L2sxCvFtFyo0NUprGdqsiRopJKlRyHUgdiPSjefPjYnJ5X7y_qi72rvvKesFEeeJy4o1A5Z7vomFNCd75onUmpsTFUpHWTBiVQYN6WTE1Uy4J5jTpCFqT-6ApABJx2eoY-g0j_30vxHhbN90nuAc_JHXF76hjlcvT0rrjB_T1xa09H8L74dQx-rs4QXHojlxILl1xOWuESTan8gp9mucI6LD9vNwyte5m_wUqSny5ZrfJNEdzJr8NRc3U2fcWU-aq0H5L67g-7ucEDcX7ybnV8qqabFVQEHd4oSzaakkLqYpcMWyKuQ6prx2QNUC26UqeIGeu4CS6VhheWU2M0lVE7IPxDcdCve34spLGubSNbi1aqIkoIkAUlqkzJbdm1oRAvtzPuf44CGh7EI4fF78JSiLc5FjuDrHk9PEAm-CkT_L8yoRCvcyR9rkyEK9J0wAB-Zo0rv0RvUjksOrA83AbbTyV75S3YOpxCS1yIF7thFFv-B4V6Xl-PNlWNls4U4tGYGzufbYsB4E4hFrOsmb3UfKT__m0Q9EY9tNm5J_9jGp6KmwaNVz4vqdtDcbC5vOZnaJw24flQI78B4dkXWA priority: 102 providerName: Directory of Open Access Journals – databaseName: ProQuest Technology Collection dbid: 8FG link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1Nb9QwELWgXOBQUT4DLTIIiZPV2E7ihAvaViwVKgjBFvUWjZ1xQUJJu7s989cZJ950AxLX9SQ7yXjmzTj2G8Zel05jIa0TqtCNyMCAgBRS4Q3aVFpA2TPefPpcnJxlH8_z87jgtorbKjcxsQ_UTefCGvmhpmKK6nPKWN5dXonQNSp8XY0tNG6zO5KQJmzpKucfxoKLrigHNqFw6eGK4IjQ2sgJBvVU_f8G5C1Emu6W3IKf-X22G_NGPhsMvcduYfuA3dtiE3zIfh9TlS5OycTwls84hS8-i4zhnFJT_o1ujXxB0Zh_3Wwb6loeVmI58C9LFIvQLwIb_r0_cC5O41omDw3TfnFom7_kpgKP2Nn8_eL4RMT-CsJRUbwWGrRTKVjfuMYr1ACYW5_nBkErwjZnUukdvbEGC2t8qrDU6AslIXXSEM4_Zjtt1-JTxpU2VeVQa0qoMgBfVkZTYZSpFKu0qWzCXm3eeH050GjUVH4Es9SjWRJ2FGwxCgTm6_6HbnlRR0eqbWZIIWcQvadqjf5IeV9oZzOQsvBZwt4ES9bBP8lcDuIxA9IzMF3VM8pQMkOhhyT3N8auo-Ou6ptplrCX4zC5XPiOAi1214NMllNipxL2ZJgbo866ogFCn4SVk1kzeajpSPvzR0_rTV5RBeWe_V-v5-yuosQqnIeU1T7bWS-v8YASo7V90c_-PwtxDgw priority: 102 providerName: ProQuest |
Title | CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model |
URI | https://www.ncbi.nlm.nih.gov/pubmed/39599146 https://www.proquest.com/docview/3133390013 https://www.proquest.com/docview/3133459932 https://pubmed.ncbi.nlm.nih.gov/PMC11598184 https://doaj.org/article/b4773dc7eeff4998972ff63cb4a116f4 |
Volume | 24 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V9gIHxLuBsjIIiZMhsZ04QaqqbdWlQm1VlV20t8h2xoBUZWG7leDEX2ecx2oDHLnkEI-Sicfjmc-OvwF4lTuJWWIdF5msuDLacBObmHuNNk6swaRhvDk7z05m6sM8nW9BX2Oz68Drf0K7UE9qtrx68-P7zwNy-P2AOAmyv72mMENROJwk36GApEMhgzO13kwQkmBYSyo0FB-Eooax_-95eSMwDX-a3IhCk3twt0sf2bi1933YwvoB3NkgFXwIv44IrPNTsrR5x8aMZjE27ojDGWWo7CM9GtmUJmV22f89tKhZWJBlhl0skU9D2Qis2Kfm3Dk_7ZY0WaibdsVMXf0hNxR4BLPJ8fTohHdlFrgjbLzi0kgnYmN95SovUBqDqfVpqtFIQSHO6TjxjnqswsxqHwvMJfpMJCZ2iaZw_xi260WNu8CE1EXhUErKq5QxPi-0JHykRIxFXBU2gpd9j5ffWjaNklBIMEu5NksEh8EWa4FAgN3cWCw_l50_lVZpUshpRO8JtNGLhPeZdFaZJMm8iuB1sGQZBg6Zy5nutAHpGQivyjElKkrTDESSe72xy374lZKgOylF-XEEL9bN5HlhO8XUuLhpZVRK-Z2I4Ek7NtY6y4IaKAhFkA9GzeCjhi311y8Nuzc5RxGUe_o_uuEZ3BaUhYXDk0mxB9ur5Q0-pyxqZUdwS881XfPJ-xHsHB6fX1yOmhWJUeM9vwFRYiFJ |
linkProvider | Scholars Portal |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6VcoAeEG8CBQwCcbKa2EmcICG0FJZdmlYVbFFvwXHGBQlly-5WiBP_iN_IOC82IHHrNR45k8zz82MG4EliJMZBYbiIZclDrTTXvva5VVj4QaExqCve7B_Ek6Pw3XF0vAG_ursw7lhl5xNrR13OjVsj35EEpgifU8by8vQbd12j3O5q10KjUYs9_PGdINvyxfQ1yfepEOM3s90Jb7sKcENQcMWllkb4urClKa1AqTVGhY0ihVoK8uhG-YE1FNVLjAtlfYGJRBuLQPvGtbKXNO8FuBgSNw7sJeO3PcAjDpOmepFjdWdJ4Y_mUcEg5tWtAf4NAGsRcHg6cy3cja_ClTZPZaNGsa7BBlbXYWuteuEN-LmbTQ95Riqln7MRI3fJRm2FckapMPtAUyObkfdn77tjSvOKuZVfptnhAvnM9afAkn2sL7jzrF07Za5B21emq_IvuiHBTTg6lz9_CzareYV3gAmp0tSglJTAhVrbJFWSgFgofEz9Mi08eNz98fy0KduRE9xxYsl7sXjwysmiJ3CVtusH88VJ3hpuXoSKGDIK0VpCh_QiYW0sTRHqIIht6MEzJ8nc-QMSl9HttQbi01XWykeUEYWKXB1RbnfCzltHscz_qLUHj_phMnG3b6MrnJ81NGFEiaTw4HajGz3PMqUBinYeJAOtGXzUcKT68rkuI05WmDrm7v6fr4dwaTLbz_JserB3Dy4LSurcXcwg3YbN1eIM71NStioe1JbA4NN5m95v6fRMXg |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwEB4tXQnBAfEmsIBBIE5WEzuNEySEuo9qy5aqWrqrvQXHGQMSSpa2K8SJ_8WvY5xHaUDittd45Ew8z8-xZwBexEZiFGSGi0jmPNRKc-1rn1uFmR9kGoOq4s37aXR4Er47G5xtwa_2Low7Vtn6xMpR56Vxe-R9SWCK8DllLH3bHIuY7Y_enn_jroOU-9PattOoVeQIf3wn-LZ8M94nWb8UYnQw3zvkTYcBbggWrrjU0ghfZzY3uRUotcZBZgcDhVoK8u5G-YE1FOFzjDJlfYGxRBuJQPvGtbWXNO8V2FYOFfVge_dgOjtewz3iN65rGTnG-0sKhjSTCjoRsGoU8G842IiH3bOaG8FvdBNuNFkrG9Zqdgu2sLgN1zdqGd6Bn3uT8YxPSMH0azZk5DzZsKlXzigxZh9oamRzWlB23B5aKgvm9oGZZrMF8rnrVoE5O62uu_NJs5PKXLu2r0wX-V90XYK7cHIpa38PekVZ4ANgQqokMSglpXOh1jZOlCRYFgofEz9PMg-etyuentdFPFICP04s6VosHuw6WawJXN3t6kG5-JQ2ZpxmoSKGjEK0lrAivUhYG0mThToIIht68MpJMnXegcRldHPJgfh0dbbSIeVHoSLHR5Q7rbDTxm0s0z9K7sGz9TAZvPuLowssL2qacEBppfDgfq0ba55lQgMU-zyIO1rT-ajuSPHlc1VUnGwyccw9_D9fT-EqmV06GU-PHsE1QRmeu5gZJDvQWy0u8DFlaKvsSWMKDD5etvX9BsNSUfA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CLIP-Llama%3A+A+New+Approach+for+Scene+Text+Recognition+with+a+Pre-Trained+Vision-Language+Model+and+a+Pre-Trained+Language+Model&rft.jtitle=Sensors+%28Basel%2C+Switzerland%29&rft.au=Xiaoqing+Zhao&rft.au=Miaomiao+Xu&rft.au=Wushour+Silamu&rft.au=Yanbing+Li&rft.date=2024-11-19&rft.pub=MDPI+AG&rft.eissn=1424-8220&rft.volume=24&rft.issue=22&rft.spage=7371&rft_id=info:doi/10.3390%2Fs24227371&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1424-8220&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1424-8220&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1424-8220&client=summon |