Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words...

Full description

Saved in:
Bibliographic Details
Published inJournal of biomedical semantics Vol. 10; no. S1; pp. 17 - 7
Main Authors Tissot, Hegler, Dobson, Richard
Format Journal Article
LanguageEnglish
Published England BioMed Central Ltd 12.11.2019
BioMed Central
BMC
Subjects
Online AccessGet full text

Cover

Loading…
Abstract There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
AbstractList There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Keywords: Phonetic similarity, Similarity search, Misspelt names of drugs
There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.BACKGROUNDThere is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.RESULTSExperimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.CONCLUSIONWe present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
ArticleNumber 17
Audience Academic
Author Dobson, Richard
Tissot, Hegler
Author_xml – sequence: 1
  givenname: Hegler
  surname: Tissot
  fullname: Tissot, Hegler
– sequence: 2
  givenname: Richard
  surname: Dobson
  fullname: Dobson, Richard
BackLink https://www.ncbi.nlm.nih.gov/pubmed/31711534$$D View this record in MEDLINE/PubMed
BookMark eNp9ksFu1DAQhiNUREvpA3BBlrhwSbFjx85ekKoVhUqV4ABna-LYWVeJvdjeol549k7YUnURwj44Gn_zZ2b8v6yOQgy2ql4zes5YJ99nxnkja8pWNW2YrJtn1UlDBauZ6OjRk-_j6iznG4qLc0Y7_qI65kwx1nJxUv1ax7n3wYeR5JKWA8JAthv8V_GGZD_7CZIvd2SGYjYLUCLxgw3FOwz6nLd2KiTAbDOJjgxpN2biA5nt4A1MJFkT05DJT1QpNixXX2Mqu3Fns31VPXcwZXv2cJ5W3y8_flt_rq-_fLpaX1zXpuVNqcFw5cQKKJNOtUC71nHqVMN7ZaSizrTCNgranklBGbXGStkNinIzgKCrhp9WV3vdIcKN3iY_Q7rTEbz-HYhp1JCw4cnqngsDmMmlAqF61YHqqewYOCZQrkOtD3ut7a7HJg2OIsF0IHp4E_xGj_FWy06smFyKefcgkOIPnELROEZjpwmCjbusG84E5fhCCtG3e3QELM0HF1HRLLi-kFRxyVZSIHX-Dwr3YGdv8Cmdx_hBwpunLTzW_scXCLA9YFLMOVn3iDCqF_vpvf002k8v9tNLW-qvHOMLFB-XKfjpP5n3shPfGw
CitedBy_id crossref_primary_10_4018_JOEUC_302893
crossref_primary_10_1016_j_procs_2021_05_069
crossref_primary_10_1016_j_zefq_2020_01_006
crossref_primary_10_3390_app11083659
crossref_primary_10_3390_axioms11100547
crossref_primary_10_1007_s41666_021_00096_6
crossref_primary_10_1109_JBHI_2020_2977925
Cites_doi 10.1016/j.ijmedinf.2010.09.005
10.1186/s12911-016-0255-x
10.1038/nrg3208
10.1109/CIST.2011.6148583
10.1145/375360.375365
10.1109/ICASSP.2010.5495652
10.5210/fm.v12i12.2043
10.1109/TKDE.2011.253
10.1002/j.1538-7305.1950.tb00463.x
ContentType Journal Article
Copyright COPYRIGHT 2019 BioMed Central Ltd.
The Author(s) 2019
Copyright_xml – notice: COPYRIGHT 2019 BioMed Central Ltd.
– notice: The Author(s) 2019
DBID AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7X8
5PM
DOA
DOI 10.1186/s13326-019-0216-2
DatabaseName CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList


MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 3
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
EISSN 2041-1480
EndPage 7
ExternalDocumentID oai_doaj_org_article_b34ca03c367a47b78a7b0681af14cda8
PMC6849162
A607361964
31711534
10_1186_s13326_019_0216_2
Genre Research Support, Non-U.S. Gov't
Journal Article
GeographicLocations Portugal
GeographicLocations_xml – name: Portugal
GrantInformation_xml – fundername: Chief Scientist Office
– fundername: Medical Research Council
  grantid: MC_PC_17214
– fundername: Wellcome Trust
– fundername: British Heart Foundation
GroupedDBID 0R~
53G
5VS
7X7
88E
8FE
8FG
8FH
8FI
8FJ
AAFWJ
AAJSJ
AASML
AAYXX
ABDBF
ABJCF
ABUWG
ACGFO
ACGFS
ACIWK
ACPRK
ACUHS
ADBBV
ADRAZ
ADUKV
AEGXH
AENEX
AFKRA
AFPKN
AHBYD
AHYZX
AIAGR
ALIPV
ALMA_UNASSIGNED_HOLDINGS
AMKLP
AMTXH
AOIJS
BAPOH
BAWUL
BBNVY
BCNDV
BENPR
BFQNJ
BGLVJ
BHPHI
BMC
BPHCQ
BVXVI
C6C
CCPQU
CITATION
DIK
E3Z
EBD
EBLON
EBS
EJD
ESX
F5P
FYUFA
GROUPED_DOAJ
GX1
HCIFZ
HMCUK
HYE
IAO
IEA
IHR
INH
INR
ITC
KQ8
L6V
LK8
M1P
M48
M7P
M7S
ML~
M~E
O5R
O5S
OK1
PGMZT
PHGZM
PHGZT
PIMPY
PQQKQ
PROAC
PSQYO
PTHSS
RBZ
RNS
ROL
RPM
RSV
SMT
SOJ
TUS
UKHRP
CGR
CUY
CVF
ECM
EIF
NPM
PJZUB
PPXIY
PQGLB
PMFND
7X8
5PM
PUEGO
ID FETCH-LOGICAL-c532t-ac37f49a016f75a085f30f723b7c670fc54e27a5b164010ece668d703cda40923
IEDL.DBID M48
ISSN 2041-1480
IngestDate Wed Aug 27 01:31:43 EDT 2025
Thu Aug 21 18:29:05 EDT 2025
Mon Jul 21 10:21:52 EDT 2025
Tue Jun 17 20:52:46 EDT 2025
Tue Jun 10 20:37:47 EDT 2025
Mon Jul 21 05:43:14 EDT 2025
Tue Jul 01 03:54:47 EDT 2025
Thu Apr 24 23:09:07 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue S1
Keywords Phonetic similarity
Misspelt names of drugs
Similarity search
Language English
License Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c532t-ac37f49a016f75a085f30f723b7c670fc54e27a5b164010ece668d703cda40923
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.1186/s13326-019-0216-2
PMID 31711534
PQID 2314037117
PQPubID 23479
PageCount 7
ParticipantIDs doaj_primary_oai_doaj_org_article_b34ca03c367a47b78a7b0681af14cda8
pubmedcentral_primary_oai_pubmedcentral_nih_gov_6849162
proquest_miscellaneous_2314037117
gale_infotracmisc_A607361964
gale_infotracacademiconefile_A607361964
pubmed_primary_31711534
crossref_primary_10_1186_s13326_019_0216_2
crossref_citationtrail_10_1186_s13326_019_0216_2
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2019-11-12
PublicationDateYYYYMMDD 2019-11-12
PublicationDate_xml – month: 11
  year: 2019
  text: 2019-11-12
  day: 12
PublicationDecade 2010
PublicationPlace England
PublicationPlace_xml – name: England
– name: London
PublicationTitle Journal of biomedical semantics
PublicationTitleAlternate J Biomed Semantics
PublicationYear 2019
Publisher BioMed Central Ltd
BioMed Central
BMC
Publisher_xml – name: BioMed Central Ltd
– name: BioMed Central
– name: BMC
References 216_CR1
H Tissot (216_CR14) 2014
PB Jensen (216_CR5) 2012; 13
216_CR3
C Senger (216_CR6) 2010; 79
VI Levenshtein (216_CR8) 1966; 10
G Navarro (216_CR21) 2001; 33
J Davis (216_CR24) 2006
S Ji (216_CR22) 2009
R Hamming (216_CR13) 1950; 26
216_CR16
P. Shvaiko (216_CR2) 2013; 25
P Ladefoged (216_CR17) 1996
216_CR18
D Fenz (216_CR23) 2012
216_CR19
216_CR12
S Godbole (216_CR7) 2010
J Zobel (216_CR15) 1996
M Khabsa (216_CR20) 2012
O Uzuner (216_CR4) 2010; 17
216_CR10
216_CR11
WE Winkler (216_CR9) 1990
References_xml – volume-title: Proceedings of the Section on Survey Research
  year: 1990
  ident: 216_CR9
– volume: 17
  start-page: 514
  issue: 5
  year: 2010
  ident: 216_CR4
  publication-title: JAMIA
– volume: 79
  start-page: 832
  issue: 12
  year: 2010
  ident: 216_CR6
  publication-title: I J Med Inf
  doi: 10.1016/j.ijmedinf.2010.09.005
– ident: 216_CR3
  doi: 10.1186/s12911-016-0255-x
– volume: 13
  start-page: 395
  issue: 6
  year: 2012
  ident: 216_CR5
  publication-title: Nat Rev Genet
  doi: 10.1038/nrg3208
– volume-title: CIKM
  year: 2010
  ident: 216_CR7
– volume: 10
  start-page: 707
  issue: 8
  year: 1966
  ident: 216_CR8
  publication-title: Sov Phys Dokl
– volume-title: Scientific and Statistical Database Management. Lecture Notes in Computer Science, vol 7338
  year: 2012
  ident: 216_CR23
– ident: 216_CR11
– volume-title: The Sounds of the World’s Languages
  year: 1996
  ident: 216_CR17
– ident: 216_CR1
  doi: 10.1109/CIST.2011.6148583
– ident: 216_CR12
– volume: 33
  start-page: 31
  issue: 1
  year: 2001
  ident: 216_CR21
  publication-title: ACM Comput Surv
  doi: 10.1145/375360.375365
– ident: 216_CR19
– ident: 216_CR18
– volume-title: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries
  year: 2012
  ident: 216_CR20
– ident: 216_CR16
  doi: 10.1109/ICASSP.2010.5495652
– volume-title: Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Munich, Germany, September 1-4, 2014. Proceedings, Part II
  year: 2014
  ident: 216_CR14
– ident: 216_CR10
  doi: 10.5210/fm.v12i12.2043
– volume-title: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96
  year: 1996
  ident: 216_CR15
– volume: 25
  start-page: 158
  issue: 1
  year: 2013
  ident: 216_CR2
  publication-title: IEEE Transactions on Knowledge and Data Engineering
  doi: 10.1109/TKDE.2011.253
– volume: 26
  start-page: 147
  issue: 2
  year: 1950
  ident: 216_CR13
  publication-title: Bell Syst Tech J
  doi: 10.1002/j.1538-7305.1950.tb00463.x
– volume-title: Proceedings of the 18th International Conference on World Wide Web, WWW ’09
  year: 2009
  ident: 216_CR22
– volume-title: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06
  year: 2006
  ident: 216_CR24
SSID ssj0000331083
Score 2.2087874
Snippet There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may...
Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free...
Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction...
SourceID doaj
pubmedcentral
proquest
gale
pubmed
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
Enrichment Source
StartPage 17
SubjectTerms Algorithms
Information Storage and Retrieval
Language
Medical Records
Misspelt names of drugs
Natural Language Processing
Pharmaceutical Preparations
Phonetic similarity
Phonetics
Portugal
Similarity search
SummonAdditionalLinks – databaseName: DOAJ Directory of Open Access Journals
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Na9wwEBUlp15KP9Ntk6JCaaFgYkuy5D2mpSGUtKcGchOSLKWGxBviXXrrb88b2VnWFNpLTwZLsiXNSPPGnnli7F1TpwRd0QWRTxXKLWPhYqsKo3wrlYbBzHlr377r03P19aK-2Dnqi2LCRnrgceKOvFTBlTJIbZwy3jTO-FI3lUuVCq3Lab6weTvOVN6DJd7RyOk3ZtXoowHOmCDneVnArOlCzAxR5uv_c1feMUvzkMkdG3TymD2awCM_Hjv9hD2I_VO2fzZ9chz4e362ZUkenrHfWOw-HwDB6XQOXFzfcgpGp8xFPnTXHfxawHAO2JpjKvl6xbucuptwEzK5iVdr3lMkLV8l3t5uLgfe9fx6_L3Dx088A_-FpwB8UxGFpm4uMaz4nJ2ffPnx-bSYzlsoQi3FunBBmqSWDigwmdoBjCVZJiOkN0GbMoVaRWFc7eFiwY2LIWrdtNgyIAe4iUK-YHs9hvCScSlkK-vGCF-3qgzSR-elXFapwSOVEgtW3k--DRMZOZ2JcWWzU9JoO8rLQl6W5GXR5OO2yc3IxPG3yp9IotuKRKKdb0C17KRa9l-qtWAfSB8sLXV0LrgpYwFDJNIse6yxP2piNFuwg1lNCCjMit_ea5SlIopr6-NqM1iga0WkiZVZsP1Rw7Z9BrIDXJdobWa6NxvUvKTvfmaGcN0owH7x6n_Mwmv2UNCqodBHccD21rebeAggtvZv8pq7A3aVMFQ
  priority: 102
  providerName: Directory of Open Access Journals
Title Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese
URI https://www.ncbi.nlm.nih.gov/pubmed/31711534
https://www.proquest.com/docview/2314037117
https://pubmed.ncbi.nlm.nih.gov/PMC6849162
https://doaj.org/article/b34ca03c367a47b78a7b0681af14cda8
Volume 10
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjR1db9Mw0BrbCy-Iz1EYlZEQSEiBxHZs9wGhDq1M1TYhoFLfLDuxS6UuHU0r4IXfzp2TVouYEC-NFH_U5_t2zneEvNB5CEArMsHkU4mwA59YX4pECVdyIUFhxntr5xfydCLG03y6R7blrdoNrG907bCe1GS1ePPz-6_3wPDvIsNr-bYGP4uhXzxIQGPJBCTyASgmhXx63lr7UTBz-OOYmJOlIkvAEdh-57xxlo6mign9_xbb1_RWN6bympIa3SV3WuuSDhtyuEf2fHWfHJ61Z5I1fUnPdmmU6wfkN0gDFytEUCzfAQ9blRSj1fFqI63nl3PYHbDTKdi1MeiSrpd0Hu_2BngJSLvyizWtMNSWLgMtV5tZTecVvWy-_9DmDKimP2AWsM6xCWNXNzMAyz8kk9HJ1w-nSVuQISlyztaJLbgKYmDBTAwqt2CtBZ4GxbhThVRpKHLhmbK5Ax8M_DxfeCl1CTKlKC34kYw_IvsVgPCYUM54yXOtmMtLkRbcees4H2RBw5RCsB5Jt5tvijZbORbNWJjotWhpGnwZwJdBfBkY8no35KpJ1fGvzseI0V1HzLIdXyxXM9MyrXFcFBZWz6WyQjmlrXKp1JkNmQCQdI-8QnowSJ2wuMK2VxoARMyqZYYSBKjElGc9ctTpCQgqOs3PtxRlsAkD3yq_3NQGzG-BWRUz1SOHDYXt1gymH9jzHEarDu11gOq2VPNvMYW41AL8AvbkvwF4Sm4zZA0MgGRHZH-92vhnYI6tXZ_cUlMFv3r0sU8OhsPxlzE8j08uPn3uxyOOfmTDP1QLNkE
linkProvider Scholars Portal
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Combining+string+and+phonetic+similarity+matching+to+identify+misspelt+names+of+drugs+in+medical+records+written+in+Portuguese&rft.jtitle=Journal+of+biomedical+semantics&rft.au=Tissot%2C+Hegler&rft.au=Dobson%2C+Richard&rft.date=2019-11-12&rft.pub=BioMed+Central+Ltd&rft.issn=2041-1480&rft.eissn=2041-1480&rft.volume=10&rft.issue=Suppl+1&rft_id=info:doi/10.1186%2Fs13326-019-0216-2&rft.externalDocID=A607361964
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2041-1480&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2041-1480&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2041-1480&client=summon