Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words...

Full description

Saved in:

Bibliographic Details
Published in	Journal of biomedical semantics Vol. 10; no. S1; pp. 17 - 7
Main Authors	Tissot, Hegler, Dobson, Richard
Format	Journal Article
Language	English
Published	England BioMed Central Ltd 12.11.2019 BioMed Central BMC
Subjects	Algorithms Information Storage and Retrieval Language Medical Records Misspelt names of drugs Natural Language Processing Pharmaceutical Preparations Phonetic similarity Phonetics Portugal Similarity search Portugal Phonetic similarity Misspelt names of drugs Similarity search
Online Access	Get full text

Cover

Loading…

Abstract	There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
AbstractList	There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Keywords: Phonetic similarity, Similarity search, Misspelt names of drugs There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.BACKGROUNDThere is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.RESULTSExperimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.CONCLUSIONWe present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
ArticleNumber	17
Audience	Academic
Author	Dobson, Richard Tissot, Hegler
Author_xml	– sequence: 1 givenname: Hegler surname: Tissot fullname: Tissot, Hegler – sequence: 2 givenname: Richard surname: Dobson fullname: Dobson, Richard
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/31711534$$D View this record in MEDLINE/PubMed
BookMark	eNp9ksFu1DAQhiNUREvpA3BBlrhwSbFjx85ekKoVhUqV4ABna-LYWVeJvdjeol549k7YUnURwj44Gn_zZ2b8v6yOQgy2ql4zes5YJ99nxnkja8pWNW2YrJtn1UlDBauZ6OjRk-_j6iznG4qLc0Y7_qI65kwx1nJxUv1ax7n3wYeR5JKWA8JAthv8V_GGZD_7CZIvd2SGYjYLUCLxgw3FOwz6nLd2KiTAbDOJjgxpN2biA5nt4A1MJFkT05DJT1QpNixXX2Mqu3Fns31VPXcwZXv2cJ5W3y8_flt_rq-_fLpaX1zXpuVNqcFw5cQKKJNOtUC71nHqVMN7ZaSizrTCNgranklBGbXGStkNinIzgKCrhp9WV3vdIcKN3iY_Q7rTEbz-HYhp1JCw4cnqngsDmMmlAqF61YHqqewYOCZQrkOtD3ut7a7HJg2OIsF0IHp4E_xGj_FWy06smFyKefcgkOIPnELROEZjpwmCjbusG84E5fhCCtG3e3QELM0HF1HRLLi-kFRxyVZSIHX-Dwr3YGdv8Cmdx_hBwpunLTzW_scXCLA9YFLMOVn3iDCqF_vpvf002k8v9tNLW-qvHOMLFB-XKfjpP5n3shPfGw
CitedBy_id	crossref_primary_10_4018_JOEUC_302893 crossref_primary_10_1016_j_procs_2021_05_069 crossref_primary_10_1016_j_zefq_2020_01_006 crossref_primary_10_3390_app11083659 crossref_primary_10_3390_axioms11100547 crossref_primary_10_1007_s41666_021_00096_6 crossref_primary_10_1109_JBHI_2020_2977925
Cites_doi	10.1016/j.ijmedinf.2010.09.005 10.1186/s12911-016-0255-x 10.1038/nrg3208 10.1109/CIST.2011.6148583 10.1145/375360.375365 10.1109/ICASSP.2010.5495652 10.5210/fm.v12i12.2043 10.1109/TKDE.2011.253 10.1002/j.1538-7305.1950.tb00463.x
ContentType	Journal Article
Copyright	COPYRIGHT 2019 BioMed Central Ltd. The Author(s) 2019
Copyright_xml	– notice: COPYRIGHT 2019 BioMed Central Ltd. – notice: The Author(s) 2019
DBID	AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 5PM DOA
DOI	10.1186/s13326-019-0216-2
DatabaseName	CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic MEDLINE
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Languages & Literatures
EISSN	2041-1480
EndPage	7
ExternalDocumentID	oai_doaj_org_article_b34ca03c367a47b78a7b0681af14cda8 PMC6849162 A607361964 31711534 10_1186_s13326_019_0216_2
Genre	Research Support, Non-U.S. Gov't Journal Article
GeographicLocations	Portugal
GeographicLocations_xml	– name: Portugal
GrantInformation_xml	– fundername: Chief Scientist Office – fundername: Medical Research Council grantid: MC_PC_17214 – fundername: Wellcome Trust – fundername: British Heart Foundation
GroupedDBID	0R~ 53G 5VS 7X7 88E 8FE 8FG 8FH 8FI 8FJ AAFWJ AAJSJ AASML AAYXX ABDBF ABJCF ABUWG ACGFO ACGFS ACIWK ACPRK ACUHS ADBBV ADRAZ ADUKV AEGXH AENEX AFKRA AFPKN AHBYD AHYZX AIAGR ALIPV ALMA_UNASSIGNED_HOLDINGS AMKLP AMTXH AOIJS BAPOH BAWUL BBNVY BCNDV BENPR BFQNJ BGLVJ BHPHI BMC BPHCQ BVXVI C6C CCPQU CITATION DIK E3Z EBD EBLON EBS EJD ESX F5P FYUFA GROUPED_DOAJ GX1 HCIFZ HMCUK HYE IAO IEA IHR INH INR ITC KQ8 L6V LK8 M1P M48 M7P M7S ML~ M~E O5R O5S OK1 PGMZT PHGZM PHGZT PIMPY PQQKQ PROAC PSQYO PTHSS RBZ RNS ROL RPM RSV SMT SOJ TUS UKHRP CGR CUY CVF ECM EIF NPM PJZUB PPXIY PQGLB PMFND 7X8 5PM PUEGO
ID	FETCH-LOGICAL-c532t-ac37f49a016f75a085f30f723b7c670fc54e27a5b164010ece668d703cda40923
IEDL.DBID	M48
ISSN	2041-1480
IngestDate	Wed Aug 27 01:31:43 EDT 2025 Thu Aug 21 18:29:05 EDT 2025 Mon Jul 21 10:21:52 EDT 2025 Tue Jun 17 20:52:46 EDT 2025 Tue Jun 10 20:37:47 EDT 2025 Mon Jul 21 05:43:14 EDT 2025 Tue Jul 01 03:54:47 EDT 2025 Thu Apr 24 23:09:07 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	S1
Keywords	Phonetic similarity Misspelt names of drugs Similarity search
Language	English
License	Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c532t-ac37f49a016f75a085f30f723b7c670fc54e27a5b164010ece668d703cda40923
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
OpenAccessLink	http://journals.scholarsportal.info/openUrl.xqy?doi=10.1186/s13326-019-0216-2
PMID	31711534
PQID	2314037117
PQPubID	23479
PageCount	7
ParticipantIDs	doaj_primary_oai_doaj_org_article_b34ca03c367a47b78a7b0681af14cda8 pubmedcentral_primary_oai_pubmedcentral_nih_gov_6849162 proquest_miscellaneous_2314037117 gale_infotracmisc_A607361964 gale_infotracacademiconefile_A607361964 pubmed_primary_31711534 crossref_primary_10_1186_s13326_019_0216_2 crossref_citationtrail_10_1186_s13326_019_0216_2
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2019-11-12
PublicationDateYYYYMMDD	2019-11-12
PublicationDate_xml	– month: 11 year: 2019 text: 2019-11-12 day: 12
PublicationDecade	2010
PublicationPlace	England
PublicationPlace_xml	– name: England – name: London
PublicationTitle	Journal of biomedical semantics
PublicationTitleAlternate	J Biomed Semantics
PublicationYear	2019
Publisher	BioMed Central Ltd BioMed Central BMC
Publisher_xml	– name: BioMed Central Ltd – name: BioMed Central – name: BMC
References	216_CR1 H Tissot (216_CR14) 2014 PB Jensen (216_CR5) 2012; 13 216_CR3 C Senger (216_CR6) 2010; 79 VI Levenshtein (216_CR8) 1966; 10 G Navarro (216_CR21) 2001; 33 J Davis (216_CR24) 2006 S Ji (216_CR22) 2009 R Hamming (216_CR13) 1950; 26 216_CR16 P. Shvaiko (216_CR2) 2013; 25 P Ladefoged (216_CR17) 1996 216_CR18 D Fenz (216_CR23) 2012 216_CR19 216_CR12 S Godbole (216_CR7) 2010 J Zobel (216_CR15) 1996 M Khabsa (216_CR20) 2012 O Uzuner (216_CR4) 2010; 17 216_CR10 216_CR11 WE Winkler (216_CR9) 1990
References_xml	– volume-title: Proceedings of the Section on Survey Research year: 1990 ident: 216_CR9 – volume: 17 start-page: 514 issue: 5 year: 2010 ident: 216_CR4 publication-title: JAMIA – volume: 79 start-page: 832 issue: 12 year: 2010 ident: 216_CR6 publication-title: I J Med Inf doi: 10.1016/j.ijmedinf.2010.09.005 – ident: 216_CR3 doi: 10.1186/s12911-016-0255-x – volume: 13 start-page: 395 issue: 6 year: 2012 ident: 216_CR5 publication-title: Nat Rev Genet doi: 10.1038/nrg3208 – volume-title: CIKM year: 2010 ident: 216_CR7 – volume: 10 start-page: 707 issue: 8 year: 1966 ident: 216_CR8 publication-title: Sov Phys Dokl – volume-title: Scientific and Statistical Database Management. Lecture Notes in Computer Science, vol 7338 year: 2012 ident: 216_CR23 – ident: 216_CR11 – volume-title: The Sounds of the World’s Languages year: 1996 ident: 216_CR17 – ident: 216_CR1 doi: 10.1109/CIST.2011.6148583 – ident: 216_CR12 – volume: 33 start-page: 31 issue: 1 year: 2001 ident: 216_CR21 publication-title: ACM Comput Surv doi: 10.1145/375360.375365 – ident: 216_CR19 – ident: 216_CR18 – volume-title: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries year: 2012 ident: 216_CR20 – ident: 216_CR16 doi: 10.1109/ICASSP.2010.5495652 – volume-title: Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Munich, Germany, September 1-4, 2014. Proceedings, Part II year: 2014 ident: 216_CR14 – ident: 216_CR10 doi: 10.5210/fm.v12i12.2043 – volume-title: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96 year: 1996 ident: 216_CR15 – volume: 25 start-page: 158 issue: 1 year: 2013 ident: 216_CR2 publication-title: IEEE Transactions on Knowledge and Data Engineering doi: 10.1109/TKDE.2011.253 – volume: 26 start-page: 147 issue: 2 year: 1950 ident: 216_CR13 publication-title: Bell Syst Tech J doi: 10.1002/j.1538-7305.1950.tb00463.x – volume-title: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 year: 2009 ident: 216_CR22 – volume-title: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06 year: 2006 ident: 216_CR24
SSID	ssj0000331083
Score	2.2087874
Snippet	There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may... Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free... Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction...
SourceID	doaj pubmedcentral proquest gale pubmed crossref
SourceType	Open Website Open Access Repository Aggregation Database Index Database Enrichment Source
StartPage	17
SubjectTerms	Algorithms Information Storage and Retrieval Language Medical Records Misspelt names of drugs Natural Language Processing Pharmaceutical Preparations Phonetic similarity Phonetics Portugal Similarity search
SummonAdditionalLinks	– databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Na9wwEBUlp15KP9Ntk6JCaaFgYkuy5D2mpSGUtKcGchOSLKWGxBviXXrrb88b2VnWFNpLTwZLsiXNSPPGnnli7F1TpwRd0QWRTxXKLWPhYqsKo3wrlYbBzHlr377r03P19aK-2Dnqi2LCRnrgceKOvFTBlTJIbZwy3jTO-FI3lUuVCq3Lab6weTvOVN6DJd7RyOk3ZtXoowHOmCDneVnArOlCzAxR5uv_c1feMUvzkMkdG3TymD2awCM_Hjv9hD2I_VO2fzZ9chz4e362ZUkenrHfWOw-HwDB6XQOXFzfcgpGp8xFPnTXHfxawHAO2JpjKvl6xbucuptwEzK5iVdr3lMkLV8l3t5uLgfe9fx6_L3Dx088A_-FpwB8UxGFpm4uMaz4nJ2ffPnx-bSYzlsoQi3FunBBmqSWDigwmdoBjCVZJiOkN0GbMoVaRWFc7eFiwY2LIWrdtNgyIAe4iUK-YHs9hvCScSlkK-vGCF-3qgzSR-elXFapwSOVEgtW3k--DRMZOZ2JcWWzU9JoO8rLQl6W5GXR5OO2yc3IxPG3yp9IotuKRKKdb0C17KRa9l-qtWAfSB8sLXV0LrgpYwFDJNIse6yxP2piNFuwg1lNCCjMit_ea5SlIopr6-NqM1iga0WkiZVZsP1Rw7Z9BrIDXJdobWa6NxvUvKTvfmaGcN0owH7x6n_Mwmv2UNCqodBHccD21rebeAggtvZv8pq7A3aVMFQ priority: 102 providerName: Directory of Open Access Journals
Title	Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese
URI	https://www.ncbi.nlm.nih.gov/pubmed/31711534 https://www.proquest.com/docview/2314037117 https://pubmed.ncbi.nlm.nih.gov/PMC6849162 https://doaj.org/article/b34ca03c367a47b78a7b0681af14cda8
Volume	10
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjR1db9Mw0BrbCy-Iz1EYlZEQSEiBxHZs9wGhDq1M1TYhoFLfLDuxS6UuHU0r4IXfzp2TVouYEC-NFH_U5_t2zneEvNB5CEArMsHkU4mwA59YX4pECVdyIUFhxntr5xfydCLG03y6R7blrdoNrG907bCe1GS1ePPz-6_3wPDvIsNr-bYGP4uhXzxIQGPJBCTyASgmhXx63lr7UTBz-OOYmJOlIkvAEdh-57xxlo6mign9_xbb1_RWN6bympIa3SV3WuuSDhtyuEf2fHWfHJ61Z5I1fUnPdmmU6wfkN0gDFytEUCzfAQ9blRSj1fFqI63nl3PYHbDTKdi1MeiSrpd0Hu_2BngJSLvyizWtMNSWLgMtV5tZTecVvWy-_9DmDKimP2AWsM6xCWNXNzMAyz8kk9HJ1w-nSVuQISlyztaJLbgKYmDBTAwqt2CtBZ4GxbhThVRpKHLhmbK5Ax8M_DxfeCl1CTKlKC34kYw_IvsVgPCYUM54yXOtmMtLkRbcees4H2RBw5RCsB5Jt5tvijZbORbNWJjotWhpGnwZwJdBfBkY8no35KpJ1fGvzseI0V1HzLIdXyxXM9MyrXFcFBZWz6WyQjmlrXKp1JkNmQCQdI-8QnowSJ2wuMK2VxoARMyqZYYSBKjElGc9ctTpCQgqOs3PtxRlsAkD3yq_3NQGzG-BWRUz1SOHDYXt1gymH9jzHEarDu11gOq2VPNvMYW41AL8AvbkvwF4Sm4zZA0MgGRHZH-92vhnYI6tXZ_cUlMFv3r0sU8OhsPxlzE8j08uPn3uxyOOfmTDP1QLNkE
linkProvider	Scholars Portal
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Combining+string+and+phonetic+similarity+matching+to+identify+misspelt+names+of+drugs+in+medical+records+written+in+Portuguese&rft.jtitle=Journal+of+biomedical+semantics&rft.au=Tissot%2C+Hegler&rft.au=Dobson%2C+Richard&rft.date=2019-11-12&rft.pub=BioMed+Central+Ltd&rft.issn=2041-1480&rft.eissn=2041-1480&rft.volume=10&rft.issue=Suppl+1&rft_id=info:doi/10.1186%2Fs13326-019-0216-2&rft.externalDocID=A607361964
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2041-1480&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2041-1480&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2041-1480&client=summon