CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various d...

Full description

Saved in:
Bibliographic Details
Published inSensors (Basel, Switzerland) Vol. 24; no. 22; p. 7371
Main Authors Zhao, Xiaoqing, Xu, Miaomiao, Silamu, Wushour, Li, Yanbing
Format Journal Article
LanguageEnglish
Published Switzerland MDPI AG 19.11.2024
MDPI
Subjects
Online AccessGet full text

Cover

Loading…
Abstract This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP’s image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.
AbstractList This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP’s image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.
This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.
Audience Academic
Author Zhao, Xiaoqing
Xu, Miaomiao
Silamu, Wushour
Li, Yanbing
AuthorAffiliation College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China
AuthorAffiliation_xml – name: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China
Author_xml – sequence: 1
  givenname: Xiaoqing
  surname: Zhao
  fullname: Zhao, Xiaoqing
– sequence: 2
  givenname: Miaomiao
  surname: Xu
  fullname: Xu, Miaomiao
– sequence: 3
  givenname: Wushour
  surname: Silamu
  fullname: Silamu, Wushour
– sequence: 4
  givenname: Yanbing
  orcidid: 0000-0001-5368-6921
  surname: Li
  fullname: Li, Yanbing
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39599146$$D View this record in MEDLINE/PubMed
BookMark eNpdkktvEzEQxy1URNvAgS-ALHGhhy1-7JNLFUU8Ii1QQeBqzXrHG0cbO3g3tJz46jhNiZrKh7FmfvrP85ycOO-QkJecXUpZsbeDSIUoZMGfkDOeijQphWAnD_6n5HwYVowJKWX5jJzKKqsqnuZn5O-snl8ndQ9reEen9Ave0OlmEzzoJTU-0O8aHdIF3o70G2rfOTta7-iNHZcU6HXAZBHAOmzpTzvESFKD67bQIf3sW-wpuPYRdww8J08N9AO-uLcT8uPD-8XsU1J__TifTetEp3k1JhKkFgwa0-rWCJQAmDUmywoEKbhkumDc6DiDFvOmMExgKdHkggPTvCgzOSHzvW7rYaU2wa4h_FEerLpz-NApCKPVPaomLaKOLhCNSauqrAphTC51kwLnuUmj1tVea7Nt1tjGCY0B-iPR44izS9X534rzrCp5uVN4c68Q_K8tDqNa20Fj34NDvx2U5FKmcUVSRPT1I3Tlt8HFWd1Rcf0smgm53FMdxA6sMz4m1vG1uLY6Xoux0T_dJS9YJncVvHrYw6H4_5cRgYs9oIMfhoDmgHCmdmnV4erkPySPx5w
Cites_doi 10.1016/j.neucom.2022.07.028
10.1109/CVPR42600.2020.01213
10.1007/978-3-031-19815-1_11
10.1109/CVPR52729.2023.00958
10.3390/s24196494
10.1016/j.eswa.2014.07.008
10.1109/ICCV51070.2023.01878
10.1109/ICDAR.2019.00250
10.1109/ICDAR.2019.00253
10.1016/j.eswa.2024.124551
10.1109/CVPR52688.2022.01553
10.1109/CVPR46437.2021.01505
10.1109/TPAMI.2021.3115428
10.1109/ICDAR.2013.221
10.1007/978-3-030-58607-2_38
10.1016/j.infrared.2024.105223
10.1109/TPAMI.2016.2646371
10.1109/ICCV.2013.76
10.1109/ICDAR.2019.00252
10.18653/v1/2021.emnlp-main.595
10.1109/ICDAR.2017.233
10.1109/CVPR46437.2021.00702
10.1109/ICCV48922.2021.01393
10.1109/CVPR.2018.00917
10.1007/s11704-015-4488-0
10.1109/ICCV51070.2023.01784
10.1109/ICDAR.2015.7333942
10.1109/CVPR46437.2021.00869
10.1109/ICDAR.2019.00130
10.5244/C.26.127
10.1007/978-3-030-86549-8_21
10.1109/ICDAR.2019.00254
ContentType Journal Article
Copyright COPYRIGHT 2024 MDPI AG
2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
2024 by the authors. 2024
Copyright_xml – notice: COPYRIGHT 2024 MDPI AG
– notice: 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
– notice: 2024 by the authors. 2024
DBID AAYXX
CITATION
NPM
3V.
7X7
7XB
88E
8FI
8FJ
8FK
ABUWG
AFKRA
AZQEC
BENPR
CCPQU
DWQXO
FYUFA
GHDGH
K9.
M0S
M1P
PHGZM
PHGZT
PIMPY
PJZUB
PKEHL
PPXIY
PQEST
PQQKQ
PQUKI
PRINS
7X8
5PM
DOA
DOI 10.3390/s24227371
DatabaseName CrossRef
PubMed
ProQuest Central (Corporate)
Health & Medical Collection
ProQuest Central (purchase pre-March 2016)
Medical Database (Alumni Edition)
ProQuest Hospital Collection
Hospital Premium Collection (Alumni Edition)
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
ProQuest One Community College
ProQuest Central
Health Research Premium Collection
Health Research Premium Collection (Alumni)
ProQuest Health & Medical Complete (Alumni)
Health & Medical Collection (Alumni)
Medical Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest Health & Medical Research Collection
ProQuest One Academic Middle East (New)
ProQuest One Health & Nursing
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
MEDLINE - Academic
PubMed Central (Full Participant titles)
Directory of Open Access Journals (DOAJ)
DatabaseTitle CrossRef
PubMed
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest Health & Medical Complete (Alumni)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest One Health & Nursing
ProQuest Central China
ProQuest Central
Health Research Premium Collection
Health and Medicine Complete (Alumni Edition)
ProQuest Central Korea
Health & Medical Research Collection
ProQuest Central (New)
ProQuest Medical Library (Alumni)
ProQuest One Academic Eastern Edition
ProQuest Hospital Collection
Health Research Premium Collection (Alumni)
ProQuest Hospital Collection (Alumni)
ProQuest Health & Medical Complete
ProQuest Medical Library
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
MEDLINE - Academic
DatabaseTitleList Publicly Available Content Database



CrossRef
PubMed
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: DOA
  name: Directory of Open Access Journals (DOAJ)
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 3
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1424-8220
ExternalDocumentID oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4
PMC11598184
A818470534
39599146
10_3390_s24227371
Genre Journal Article
GroupedDBID ---
123
2WC
53G
5VS
7X7
88E
8FE
8FG
8FI
8FJ
AADQD
AAHBH
AAYXX
ABDBF
ABUWG
ACUHS
ADBBV
ADMLS
AENEX
AFKRA
AFZYC
ALIPV
ALMA_UNASSIGNED_HOLDINGS
BENPR
BPHCQ
BVXVI
CCPQU
CITATION
CS3
D1I
DU5
E3Z
EBD
ESX
F5P
FYUFA
GROUPED_DOAJ
GX1
HH5
HMCUK
HYE
IAO
ITC
KQ8
L6V
M1P
M48
MODMG
M~E
OK1
OVT
P2P
P62
PHGZM
PHGZT
PIMPY
PQQKQ
PROAC
PSQYO
RNS
RPM
TUS
UKHRP
XSB
~8M
3V.
ABJCF
ARAPS
HCIFZ
KB.
M7S
NPM
PDBOC
PMFND
7XB
8FK
AZQEC
DWQXO
K9.
PJZUB
PKEHL
PPXIY
PQEST
PQUKI
PRINS
7X8
5PM
PUEGO
ID FETCH-LOGICAL-c469t-3a3c20abfdcdf2e3aae5bf557ea32130c701fc273de6b7f02e83ef621a0c17853
IEDL.DBID M48
ISSN 1424-8220
IngestDate Wed Aug 27 01:23:07 EDT 2025
Thu Aug 21 18:34:59 EDT 2025
Fri Jul 11 11:02:50 EDT 2025
Fri Jul 25 23:32:21 EDT 2025
Tue Jun 10 20:59:51 EDT 2025
Wed Feb 19 02:03:57 EST 2025
Tue Jul 01 03:51:20 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 22
Keywords vision-language model
pre-trained language model
scene text recognition
Language English
License https://creativecommons.org/licenses/by/4.0
Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c469t-3a3c20abfdcdf2e3aae5bf557ea32130c701fc273de6b7f02e83ef621a0c17853
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0001-5368-6921
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.3390/s24227371
PMID 39599146
PQID 3133390013
PQPubID 2032333
ParticipantIDs doaj_primary_oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4
pubmedcentral_primary_oai_pubmedcentral_nih_gov_11598184
proquest_miscellaneous_3133459932
proquest_journals_3133390013
gale_infotracacademiconefile_A818470534
pubmed_primary_39599146
crossref_primary_10_3390_s24227371
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 20241119
PublicationDateYYYYMMDD 2024-11-19
PublicationDate_xml – month: 11
  year: 2024
  text: 20241119
  day: 19
PublicationDecade 2020
PublicationPlace Switzerland
PublicationPlace_xml – name: Switzerland
– name: Basel
PublicationTitle Sensors (Basel, Switzerland)
PublicationTitleAlternate Sensors (Basel)
PublicationYear 2024
Publisher MDPI AG
MDPI
Publisher_xml – name: MDPI AG
– name: MDPI
References ref_50
ref_13
ref_11
ref_10
ref_51
Pan (ref_19) 2021; 44
Luo (ref_7) 2022; 508
ref_18
ref_17
ref_16
ref_15
ref_25
ref_24
ref_23
ref_22
ref_21
ref_20
ref_29
ref_28
ref_27
ref_26
Shi (ref_14) 2017; 39
ref_36
ref_35
Yu (ref_5) 2024; 255
ref_34
ref_33
ref_32
ref_31
ref_30
ref_39
ref_38
ref_37
Zhu (ref_12) 2016; 10
Risnumawan (ref_45) 2014; 41
ref_47
ref_46
ref_44
ref_43
ref_42
ref_41
ref_40
ref_1
ref_3
ref_2
ref_49
ref_48
Yu (ref_6) 2024; 138
ref_9
ref_8
ref_4
References_xml – volume: 508
  start-page: 293
  year: 2022
  ident: ref_7
  article-title: Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2022.07.028
– ident: ref_1
  doi: 10.1109/CVPR42600.2020.01213
– ident: ref_9
– ident: ref_25
  doi: 10.1007/978-3-031-19815-1_11
– ident: ref_20
  doi: 10.1109/CVPR52729.2023.00958
– ident: ref_32
– ident: ref_4
  doi: 10.3390/s24196494
– ident: ref_26
– volume: 41
  start-page: 8027
  year: 2014
  ident: ref_45
  article-title: A robust arbitrary text detection system for natural scene images
  publication-title: Expert Syst. Appl.
  doi: 10.1016/j.eswa.2014.07.008
– ident: ref_51
  doi: 10.1109/ICCV51070.2023.01878
– ident: ref_38
  doi: 10.1109/ICDAR.2019.00250
– ident: ref_16
– ident: ref_40
  doi: 10.1109/ICDAR.2019.00253
– volume: 255
  start-page: 124551
  year: 2024
  ident: ref_5
  article-title: Multitask learning for hand heat trace time estimation and identity recognition
  publication-title: Expert Syst. Appl.
  doi: 10.1016/j.eswa.2024.124551
– ident: ref_42
– ident: ref_28
  doi: 10.1109/CVPR52688.2022.01553
– ident: ref_29
  doi: 10.1109/CVPR46437.2021.01505
– ident: ref_31
– volume: 44
  start-page: 7474
  year: 2021
  ident: ref_19
  article-title: Exploiting deep generative prior for versatile image restoration and manipulation
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2021.3115428
– ident: ref_48
  doi: 10.1109/ICDAR.2013.221
– ident: ref_10
– ident: ref_13
– ident: ref_17
– ident: ref_22
  doi: 10.1007/978-3-030-58607-2_38
– volume: 138
  start-page: 105223
  year: 2024
  ident: ref_6
  article-title: Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation
  publication-title: Infrared Phys. Technol.
  doi: 10.1016/j.infrared.2024.105223
– ident: ref_30
– ident: ref_3
– ident: ref_34
– volume: 39
  start-page: 2298
  year: 2017
  ident: ref_14
  article-title: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2016.2646371
– ident: ref_47
  doi: 10.1109/ICCV.2013.76
– ident: ref_37
  doi: 10.1109/ICDAR.2019.00252
– ident: ref_8
  doi: 10.18653/v1/2021.emnlp-main.595
– ident: ref_11
– ident: ref_35
  doi: 10.1109/ICDAR.2017.233
– ident: ref_2
  doi: 10.1109/CVPR46437.2021.00702
– ident: ref_23
  doi: 10.1109/ICCV48922.2021.01393
– ident: ref_21
  doi: 10.1109/CVPR.2018.00917
– volume: 10
  start-page: 19
  year: 2016
  ident: ref_12
  article-title: Scene text detection and recognition: Recent advances and future trends
  publication-title: Front. Comput. Sci.
  doi: 10.1007/s11704-015-4488-0
– ident: ref_27
  doi: 10.1109/ICCV51070.2023.01784
– ident: ref_49
  doi: 10.1109/ICDAR.2015.7333942
– ident: ref_50
– ident: ref_41
  doi: 10.1109/CVPR46437.2021.00869
– ident: ref_33
– ident: ref_24
  doi: 10.1109/ICDAR.2019.00130
– ident: ref_46
– ident: ref_44
  doi: 10.5244/C.26.127
– ident: ref_15
– ident: ref_18
  doi: 10.1007/978-3-030-86549-8_21
– ident: ref_36
– ident: ref_43
– ident: ref_39
  doi: 10.1109/ICDAR.2019.00254
SSID ssj0023338
Score 2.4335055
Snippet This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval,...
SourceID doaj
pubmedcentral
proquest
gale
pubmed
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
StartPage 7371
SubjectTerms Artificial intelligence
Computer vision
Deep learning
Image retrieval
Language
Llamas
Natural language processing
pre-trained language model
scene text recognition
Semantics
vision-language model
SummonAdditionalLinks – databaseName: Directory of Open Access Journals (DOAJ)
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwELVQT3BAfBMoyCAkTlZjO4k33JaKqkIFIdii3qyxMxZIKIva7Zm_znOSXW3gwIVrPFpNPDN-87L2sxCvFtFyo0NUprGdqsiRopJKlRyHUgdiPSjefPjYnJ5X7y_qi72rvvKesFEeeJy4o1A5Z7vomFNCd75onUmpsTFUpHWTBiVQYN6WTE1Uy4J5jTpCFqT-6ApABJx2eoY-g0j_30vxHhbN90nuAc_JHXF76hjlcvT0rrjB_T1xa09H8L74dQx-rs4QXHojlxILl1xOWuESTan8gp9mucI6LD9vNwyte5m_wUqSny5ZrfJNEdzJr8NRc3U2fcWU-aq0H5L67g-7ucEDcX7ybnV8qqabFVQEHd4oSzaakkLqYpcMWyKuQ6prx2QNUC26UqeIGeu4CS6VhheWU2M0lVE7IPxDcdCve34spLGubSNbi1aqIkoIkAUlqkzJbdm1oRAvtzPuf44CGh7EI4fF78JSiLc5FjuDrHk9PEAm-CkT_L8yoRCvcyR9rkyEK9J0wAB-Zo0rv0RvUjksOrA83AbbTyV75S3YOpxCS1yIF7thFFv-B4V6Xl-PNlWNls4U4tGYGzufbYsB4E4hFrOsmb3UfKT__m0Q9EY9tNm5J_9jGp6KmwaNVz4vqdtDcbC5vOZnaJw24flQI78B4dkXWA
  priority: 102
  providerName: Directory of Open Access Journals
– databaseName: ProQuest Technology Collection
  dbid: 8FG
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1Nb9QwELWgXOBQUT4DLTIIiZPV2E7ihAvaViwVKgjBFvUWjZ1xQUJJu7s989cZJ950AxLX9SQ7yXjmzTj2G8Zel05jIa0TqtCNyMCAgBRS4Q3aVFpA2TPefPpcnJxlH8_z87jgtorbKjcxsQ_UTefCGvmhpmKK6nPKWN5dXonQNSp8XY0tNG6zO5KQJmzpKucfxoKLrigHNqFw6eGK4IjQ2sgJBvVU_f8G5C1Emu6W3IKf-X22G_NGPhsMvcduYfuA3dtiE3zIfh9TlS5OycTwls84hS8-i4zhnFJT_o1ujXxB0Zh_3Wwb6loeVmI58C9LFIvQLwIb_r0_cC5O41omDw3TfnFom7_kpgKP2Nn8_eL4RMT-CsJRUbwWGrRTKVjfuMYr1ACYW5_nBkErwjZnUukdvbEGC2t8qrDU6AslIXXSEM4_Zjtt1-JTxpU2VeVQa0qoMgBfVkZTYZSpFKu0qWzCXm3eeH050GjUVH4Es9SjWRJ2FGwxCgTm6_6HbnlRR0eqbWZIIWcQvadqjf5IeV9oZzOQsvBZwt4ES9bBP8lcDuIxA9IzMF3VM8pQMkOhhyT3N8auo-Ou6ptplrCX4zC5XPiOAi1214NMllNipxL2ZJgbo866ogFCn4SVk1kzeajpSPvzR0_rTV5RBeWe_V-v5-yuosQqnIeU1T7bWS-v8YASo7V90c_-PwtxDgw
  priority: 102
  providerName: ProQuest
Title CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
URI https://www.ncbi.nlm.nih.gov/pubmed/39599146
https://www.proquest.com/docview/3133390013
https://www.proquest.com/docview/3133459932
https://pubmed.ncbi.nlm.nih.gov/PMC11598184
https://doaj.org/article/b4773dc7eeff4998972ff63cb4a116f4
Volume 24
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V9gIHxLuBsjIIiZMhsZ04QaqqbdWlQm1VlV20t8h2xoBUZWG7leDEX2ecx2oDHLnkEI-Sicfjmc-OvwF4lTuJWWIdF5msuDLacBObmHuNNk6swaRhvDk7z05m6sM8nW9BX2Oz68Drf0K7UE9qtrx68-P7zwNy-P2AOAmyv72mMENROJwk36GApEMhgzO13kwQkmBYSyo0FB-Eooax_-95eSMwDX-a3IhCk3twt0sf2bi1933YwvoB3NkgFXwIv44IrPNTsrR5x8aMZjE27ojDGWWo7CM9GtmUJmV22f89tKhZWJBlhl0skU9D2Qis2Kfm3Dk_7ZY0WaibdsVMXf0hNxR4BLPJ8fTohHdlFrgjbLzi0kgnYmN95SovUBqDqfVpqtFIQSHO6TjxjnqswsxqHwvMJfpMJCZ2iaZw_xi260WNu8CE1EXhUErKq5QxPi-0JHykRIxFXBU2gpd9j5ffWjaNklBIMEu5NksEh8EWa4FAgN3cWCw_l50_lVZpUshpRO8JtNGLhPeZdFaZJMm8iuB1sGQZBg6Zy5nutAHpGQivyjElKkrTDESSe72xy374lZKgOylF-XEEL9bN5HlhO8XUuLhpZVRK-Z2I4Ek7NtY6y4IaKAhFkA9GzeCjhi311y8Nuzc5RxGUe_o_uuEZ3BaUhYXDk0mxB9ur5Q0-pyxqZUdwS881XfPJ-xHsHB6fX1yOmhWJUeM9vwFRYiFJ
linkProvider Scholars Portal
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6VcoAeEG8CBQwCcbKa2EmcICG0FJZdmlYVbFFvwXHGBQlly-5WiBP_iN_IOC82IHHrNR45k8zz82MG4EliJMZBYbiIZclDrTTXvva5VVj4QaExqCve7B_Ek6Pw3XF0vAG_ursw7lhl5xNrR13OjVsj35EEpgifU8by8vQbd12j3O5q10KjUYs9_PGdINvyxfQ1yfepEOM3s90Jb7sKcENQcMWllkb4urClKa1AqTVGhY0ihVoK8uhG-YE1FNVLjAtlfYGJRBuLQPvGtbKXNO8FuBgSNw7sJeO3PcAjDpOmepFjdWdJ4Y_mUcEg5tWtAf4NAGsRcHg6cy3cja_ClTZPZaNGsa7BBlbXYWuteuEN-LmbTQ95Riqln7MRI3fJRm2FckapMPtAUyObkfdn77tjSvOKuZVfptnhAvnM9afAkn2sL7jzrF07Za5B21emq_IvuiHBTTg6lz9_CzareYV3gAmp0tSglJTAhVrbJFWSgFgofEz9Mi08eNz98fy0KduRE9xxYsl7sXjwysmiJ3CVtusH88VJ3hpuXoSKGDIK0VpCh_QiYW0sTRHqIIht6MEzJ8nc-QMSl9HttQbi01XWykeUEYWKXB1RbnfCzltHscz_qLUHj_phMnG3b6MrnJ81NGFEiaTw4HajGz3PMqUBinYeJAOtGXzUcKT68rkuI05WmDrm7v6fr4dwaTLbz_JserB3Dy4LSurcXcwg3YbN1eIM71NStioe1JbA4NN5m95v6fRMXg
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwEB4tXQnBAfEmsIBBIE5WEzuNEySEuo9qy5aqWrqrvQXHGQMSSpa2K8SJ_8WvY5xHaUDittd45Ew8z8-xZwBexEZiFGSGi0jmPNRKc-1rn1uFmR9kGoOq4s37aXR4Er47G5xtwa_2Low7Vtn6xMpR56Vxe-R9SWCK8DllLH3bHIuY7Y_enn_jroOU-9PattOoVeQIf3wn-LZ8M94nWb8UYnQw3zvkTYcBbggWrrjU0ghfZzY3uRUotcZBZgcDhVoK8u5G-YE1FOFzjDJlfYGxRBuJQPvGtbWXNO8V2FYOFfVge_dgOjtewz3iN65rGTnG-0sKhjSTCjoRsGoU8G842IiH3bOaG8FvdBNuNFkrG9Zqdgu2sLgN1zdqGd6Bn3uT8YxPSMH0azZk5DzZsKlXzigxZh9oamRzWlB23B5aKgvm9oGZZrMF8rnrVoE5O62uu_NJs5PKXLu2r0wX-V90XYK7cHIpa38PekVZ4ANgQqokMSglpXOh1jZOlCRYFgofEz9PMg-etyuentdFPFICP04s6VosHuw6WawJXN3t6kG5-JQ2ZpxmoSKGjEK0lrAivUhYG0mThToIIht68MpJMnXegcRldHPJgfh0dbbSIeVHoSLHR5Q7rbDTxm0s0z9K7sGz9TAZvPuLowssL2qacEBppfDgfq0ba55lQgMU-zyIO1rT-ajuSPHlc1VUnGwyccw9_D9fT-EqmV06GU-PHsE1QRmeu5gZJDvQWy0u8DFlaKvsSWMKDD5etvX9BsNSUfA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CLIP-Llama%3A+A+New+Approach+for+Scene+Text+Recognition+with+a+Pre-Trained+Vision-Language+Model+and+a+Pre-Trained+Language+Model&rft.jtitle=Sensors+%28Basel%2C+Switzerland%29&rft.au=Xiaoqing+Zhao&rft.au=Miaomiao+Xu&rft.au=Wushour+Silamu&rft.au=Yanbing+Li&rft.date=2024-11-19&rft.pub=MDPI+AG&rft.eissn=1424-8220&rft.volume=24&rft.issue=22&rft.spage=7371&rft_id=info:doi/10.3390%2Fs24227371&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_b4773dc7eeff4998972ff63cb4a116f4
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1424-8220&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1424-8220&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1424-8220&client=summon