Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese
There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words...
Saved in:
Published in | Journal of biomedical semantics Vol. 10; no. S1; pp. 17 - 7 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
England
BioMed Central Ltd
12.11.2019
BioMed Central BMC |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.
Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.
We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. |
---|---|
AbstractList | There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. Keywords: Phonetic similarity, Similarity search, Misspelt names of drugs There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.BACKGROUNDThere is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.RESULTSExperimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.CONCLUSIONWe present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods. |
ArticleNumber | 17 |
Audience | Academic |
Author | Dobson, Richard Tissot, Hegler |
Author_xml | – sequence: 1 givenname: Hegler surname: Tissot fullname: Tissot, Hegler – sequence: 2 givenname: Richard surname: Dobson fullname: Dobson, Richard |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/31711534$$D View this record in MEDLINE/PubMed |
BookMark | eNp9ksFu1DAQhiNUREvpA3BBlrhwSbFjx85ekKoVhUqV4ABna-LYWVeJvdjeol549k7YUnURwj44Gn_zZ2b8v6yOQgy2ql4zes5YJ99nxnkja8pWNW2YrJtn1UlDBauZ6OjRk-_j6iznG4qLc0Y7_qI65kwx1nJxUv1ax7n3wYeR5JKWA8JAthv8V_GGZD_7CZIvd2SGYjYLUCLxgw3FOwz6nLd2KiTAbDOJjgxpN2biA5nt4A1MJFkT05DJT1QpNixXX2Mqu3Fns31VPXcwZXv2cJ5W3y8_flt_rq-_fLpaX1zXpuVNqcFw5cQKKJNOtUC71nHqVMN7ZaSizrTCNgranklBGbXGStkNinIzgKCrhp9WV3vdIcKN3iY_Q7rTEbz-HYhp1JCw4cnqngsDmMmlAqF61YHqqewYOCZQrkOtD3ut7a7HJg2OIsF0IHp4E_xGj_FWy06smFyKefcgkOIPnELROEZjpwmCjbusG84E5fhCCtG3e3QELM0HF1HRLLi-kFRxyVZSIHX-Dwr3YGdv8Cmdx_hBwpunLTzW_scXCLA9YFLMOVn3iDCqF_vpvf002k8v9tNLW-qvHOMLFB-XKfjpP5n3shPfGw |
CitedBy_id | crossref_primary_10_4018_JOEUC_302893 crossref_primary_10_1016_j_procs_2021_05_069 crossref_primary_10_1016_j_zefq_2020_01_006 crossref_primary_10_3390_app11083659 crossref_primary_10_3390_axioms11100547 crossref_primary_10_1007_s41666_021_00096_6 crossref_primary_10_1109_JBHI_2020_2977925 |
Cites_doi | 10.1016/j.ijmedinf.2010.09.005 10.1186/s12911-016-0255-x 10.1038/nrg3208 10.1109/CIST.2011.6148583 10.1145/375360.375365 10.1109/ICASSP.2010.5495652 10.5210/fm.v12i12.2043 10.1109/TKDE.2011.253 10.1002/j.1538-7305.1950.tb00463.x |
ContentType | Journal Article |
Copyright | COPYRIGHT 2019 BioMed Central Ltd. The Author(s) 2019 |
Copyright_xml | – notice: COPYRIGHT 2019 BioMed Central Ltd. – notice: The Author(s) 2019 |
DBID | AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 5PM DOA |
DOI | 10.1186/s13326-019-0216-2 |
DatabaseName | CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
DatabaseTitleList | MEDLINE - Academic MEDLINE |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Languages & Literatures |
EISSN | 2041-1480 |
EndPage | 7 |
ExternalDocumentID | oai_doaj_org_article_b34ca03c367a47b78a7b0681af14cda8 PMC6849162 A607361964 31711534 10_1186_s13326_019_0216_2 |
Genre | Research Support, Non-U.S. Gov't Journal Article |
GeographicLocations | Portugal |
GeographicLocations_xml | – name: Portugal |
GrantInformation_xml | – fundername: Chief Scientist Office – fundername: Medical Research Council grantid: MC_PC_17214 – fundername: Wellcome Trust – fundername: British Heart Foundation |
GroupedDBID | 0R~ 53G 5VS 7X7 88E 8FE 8FG 8FH 8FI 8FJ AAFWJ AAJSJ AASML AAYXX ABDBF ABJCF ABUWG ACGFO ACGFS ACIWK ACPRK ACUHS ADBBV ADRAZ ADUKV AEGXH AENEX AFKRA AFPKN AHBYD AHYZX AIAGR ALIPV ALMA_UNASSIGNED_HOLDINGS AMKLP AMTXH AOIJS BAPOH BAWUL BBNVY BCNDV BENPR BFQNJ BGLVJ BHPHI BMC BPHCQ BVXVI C6C CCPQU CITATION DIK E3Z EBD EBLON EBS EJD ESX F5P FYUFA GROUPED_DOAJ GX1 HCIFZ HMCUK HYE IAO IEA IHR INH INR ITC KQ8 L6V LK8 M1P M48 M7P M7S ML~ M~E O5R O5S OK1 PGMZT PHGZM PHGZT PIMPY PQQKQ PROAC PSQYO PTHSS RBZ RNS ROL RPM RSV SMT SOJ TUS UKHRP CGR CUY CVF ECM EIF NPM PJZUB PPXIY PQGLB PMFND 7X8 5PM PUEGO |
ID | FETCH-LOGICAL-c532t-ac37f49a016f75a085f30f723b7c670fc54e27a5b164010ece668d703cda40923 |
IEDL.DBID | M48 |
ISSN | 2041-1480 |
IngestDate | Wed Aug 27 01:31:43 EDT 2025 Thu Aug 21 18:29:05 EDT 2025 Mon Jul 21 10:21:52 EDT 2025 Tue Jun 17 20:52:46 EDT 2025 Tue Jun 10 20:37:47 EDT 2025 Mon Jul 21 05:43:14 EDT 2025 Tue Jul 01 03:54:47 EDT 2025 Thu Apr 24 23:09:07 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | S1 |
Keywords | Phonetic similarity Misspelt names of drugs Similarity search |
Language | English |
License | Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c532t-ac37f49a016f75a085f30f723b7c670fc54e27a5b164010ece668d703cda40923 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
OpenAccessLink | http://journals.scholarsportal.info/openUrl.xqy?doi=10.1186/s13326-019-0216-2 |
PMID | 31711534 |
PQID | 2314037117 |
PQPubID | 23479 |
PageCount | 7 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_b34ca03c367a47b78a7b0681af14cda8 pubmedcentral_primary_oai_pubmedcentral_nih_gov_6849162 proquest_miscellaneous_2314037117 gale_infotracmisc_A607361964 gale_infotracacademiconefile_A607361964 pubmed_primary_31711534 crossref_primary_10_1186_s13326_019_0216_2 crossref_citationtrail_10_1186_s13326_019_0216_2 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2019-11-12 |
PublicationDateYYYYMMDD | 2019-11-12 |
PublicationDate_xml | – month: 11 year: 2019 text: 2019-11-12 day: 12 |
PublicationDecade | 2010 |
PublicationPlace | England |
PublicationPlace_xml | – name: England – name: London |
PublicationTitle | Journal of biomedical semantics |
PublicationTitleAlternate | J Biomed Semantics |
PublicationYear | 2019 |
Publisher | BioMed Central Ltd BioMed Central BMC |
Publisher_xml | – name: BioMed Central Ltd – name: BioMed Central – name: BMC |
References | 216_CR1 H Tissot (216_CR14) 2014 PB Jensen (216_CR5) 2012; 13 216_CR3 C Senger (216_CR6) 2010; 79 VI Levenshtein (216_CR8) 1966; 10 G Navarro (216_CR21) 2001; 33 J Davis (216_CR24) 2006 S Ji (216_CR22) 2009 R Hamming (216_CR13) 1950; 26 216_CR16 P. Shvaiko (216_CR2) 2013; 25 P Ladefoged (216_CR17) 1996 216_CR18 D Fenz (216_CR23) 2012 216_CR19 216_CR12 S Godbole (216_CR7) 2010 J Zobel (216_CR15) 1996 M Khabsa (216_CR20) 2012 O Uzuner (216_CR4) 2010; 17 216_CR10 216_CR11 WE Winkler (216_CR9) 1990 |
References_xml | – volume-title: Proceedings of the Section on Survey Research year: 1990 ident: 216_CR9 – volume: 17 start-page: 514 issue: 5 year: 2010 ident: 216_CR4 publication-title: JAMIA – volume: 79 start-page: 832 issue: 12 year: 2010 ident: 216_CR6 publication-title: I J Med Inf doi: 10.1016/j.ijmedinf.2010.09.005 – ident: 216_CR3 doi: 10.1186/s12911-016-0255-x – volume: 13 start-page: 395 issue: 6 year: 2012 ident: 216_CR5 publication-title: Nat Rev Genet doi: 10.1038/nrg3208 – volume-title: CIKM year: 2010 ident: 216_CR7 – volume: 10 start-page: 707 issue: 8 year: 1966 ident: 216_CR8 publication-title: Sov Phys Dokl – volume-title: Scientific and Statistical Database Management. Lecture Notes in Computer Science, vol 7338 year: 2012 ident: 216_CR23 – ident: 216_CR11 – volume-title: The Sounds of the World’s Languages year: 1996 ident: 216_CR17 – ident: 216_CR1 doi: 10.1109/CIST.2011.6148583 – ident: 216_CR12 – volume: 33 start-page: 31 issue: 1 year: 2001 ident: 216_CR21 publication-title: ACM Comput Surv doi: 10.1145/375360.375365 – ident: 216_CR19 – ident: 216_CR18 – volume-title: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries year: 2012 ident: 216_CR20 – ident: 216_CR16 doi: 10.1109/ICASSP.2010.5495652 – volume-title: Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Munich, Germany, September 1-4, 2014. Proceedings, Part II year: 2014 ident: 216_CR14 – ident: 216_CR10 doi: 10.5210/fm.v12i12.2043 – volume-title: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96 year: 1996 ident: 216_CR15 – volume: 25 start-page: 158 issue: 1 year: 2013 ident: 216_CR2 publication-title: IEEE Transactions on Knowledge and Data Engineering doi: 10.1109/TKDE.2011.253 – volume: 26 start-page: 147 issue: 2 year: 1950 ident: 216_CR13 publication-title: Bell Syst Tech J doi: 10.1002/j.1538-7305.1950.tb00463.x – volume-title: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 year: 2009 ident: 216_CR22 – volume-title: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06 year: 2006 ident: 216_CR24 |
SSID | ssj0000331083 |
Score | 2.2087874 |
Snippet | There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may... Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free... Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction... |
SourceID | doaj pubmedcentral proquest gale pubmed crossref |
SourceType | Open Website Open Access Repository Aggregation Database Index Database Enrichment Source |
StartPage | 17 |
SubjectTerms | Algorithms Information Storage and Retrieval Language Medical Records Misspelt names of drugs Natural Language Processing Pharmaceutical Preparations Phonetic similarity Phonetics Portugal Similarity search |
SummonAdditionalLinks | – databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Na9wwEBUlp15KP9Ntk6JCaaFgYkuy5D2mpSGUtKcGchOSLKWGxBviXXrrb88b2VnWFNpLTwZLsiXNSPPGnnli7F1TpwRd0QWRTxXKLWPhYqsKo3wrlYbBzHlr377r03P19aK-2Dnqi2LCRnrgceKOvFTBlTJIbZwy3jTO-FI3lUuVCq3Lab6weTvOVN6DJd7RyOk3ZtXoowHOmCDneVnArOlCzAxR5uv_c1feMUvzkMkdG3TymD2awCM_Hjv9hD2I_VO2fzZ9chz4e362ZUkenrHfWOw-HwDB6XQOXFzfcgpGp8xFPnTXHfxawHAO2JpjKvl6xbucuptwEzK5iVdr3lMkLV8l3t5uLgfe9fx6_L3Dx088A_-FpwB8UxGFpm4uMaz4nJ2ffPnx-bSYzlsoQi3FunBBmqSWDigwmdoBjCVZJiOkN0GbMoVaRWFc7eFiwY2LIWrdtNgyIAe4iUK-YHs9hvCScSlkK-vGCF-3qgzSR-elXFapwSOVEgtW3k--DRMZOZ2JcWWzU9JoO8rLQl6W5GXR5OO2yc3IxPG3yp9IotuKRKKdb0C17KRa9l-qtWAfSB8sLXV0LrgpYwFDJNIse6yxP2piNFuwg1lNCCjMit_ea5SlIopr6-NqM1iga0WkiZVZsP1Rw7Z9BrIDXJdobWa6NxvUvKTvfmaGcN0owH7x6n_Mwmv2UNCqodBHccD21rebeAggtvZv8pq7A3aVMFQ priority: 102 providerName: Directory of Open Access Journals |
Title | Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese |
URI | https://www.ncbi.nlm.nih.gov/pubmed/31711534 https://www.proquest.com/docview/2314037117 https://pubmed.ncbi.nlm.nih.gov/PMC6849162 https://doaj.org/article/b34ca03c367a47b78a7b0681af14cda8 |
Volume | 10 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjR1db9Mw0BrbCy-Iz1EYlZEQSEiBxHZs9wGhDq1M1TYhoFLfLDuxS6UuHU0r4IXfzp2TVouYEC-NFH_U5_t2zneEvNB5CEArMsHkU4mwA59YX4pECVdyIUFhxntr5xfydCLG03y6R7blrdoNrG907bCe1GS1ePPz-6_3wPDvIsNr-bYGP4uhXzxIQGPJBCTyASgmhXx63lr7UTBz-OOYmJOlIkvAEdh-57xxlo6mign9_xbb1_RWN6bympIa3SV3WuuSDhtyuEf2fHWfHJ61Z5I1fUnPdmmU6wfkN0gDFytEUCzfAQ9blRSj1fFqI63nl3PYHbDTKdi1MeiSrpd0Hu_2BngJSLvyizWtMNSWLgMtV5tZTecVvWy-_9DmDKimP2AWsM6xCWNXNzMAyz8kk9HJ1w-nSVuQISlyztaJLbgKYmDBTAwqt2CtBZ4GxbhThVRpKHLhmbK5Ax8M_DxfeCl1CTKlKC34kYw_IvsVgPCYUM54yXOtmMtLkRbcees4H2RBw5RCsB5Jt5tvijZbORbNWJjotWhpGnwZwJdBfBkY8no35KpJ1fGvzseI0V1HzLIdXyxXM9MyrXFcFBZWz6WyQjmlrXKp1JkNmQCQdI-8QnowSJ2wuMK2VxoARMyqZYYSBKjElGc9ctTpCQgqOs3PtxRlsAkD3yq_3NQGzG-BWRUz1SOHDYXt1gymH9jzHEarDu11gOq2VPNvMYW41AL8AvbkvwF4Sm4zZA0MgGRHZH-92vhnYI6tXZ_cUlMFv3r0sU8OhsPxlzE8j08uPn3uxyOOfmTDP1QLNkE |
linkProvider | Scholars Portal |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Combining+string+and+phonetic+similarity+matching+to+identify+misspelt+names+of+drugs+in+medical+records+written+in+Portuguese&rft.jtitle=Journal+of+biomedical+semantics&rft.au=Tissot%2C+Hegler&rft.au=Dobson%2C+Richard&rft.date=2019-11-12&rft.pub=BioMed+Central+Ltd&rft.issn=2041-1480&rft.eissn=2041-1480&rft.volume=10&rft.issue=Suppl+1&rft_id=info:doi/10.1186%2Fs13326-019-0216-2&rft.externalDocID=A607361964 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2041-1480&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2041-1480&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2041-1480&client=summon |