의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석

Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical docu...

Full description

Saved in:
Bibliographic Details
Published inJournal of biomedical engineering research Vol. 43; no. 2; pp. 109 - 115
Main Authors 유성림, Yoo, Sung Lim
Format Journal Article
LanguageKorean
Published 대한의용생체공학회 01.04.2022
Subjects
Online AccessGet full text
ISSN1229-0807
2288-9396

Cover

Abstract Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.
AbstractList Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic med- ical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization tech- niques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vec- torization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification. KCI Citation Count: 0
Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.
Author 유성림
Yoo, Sung Lim
Author_xml – sequence: 1
  fullname: 유성림
– sequence: 2
  fullname: Yoo, Sung Lim
BackLink https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002837174$$DAccess content in National Research Foundation of Korea (NRF)
BookMark eNotjs9KAkEAxocwyMx3mEuHDguzM87szlGkP5YghPdhdXdiW9kFtwco2GMHD5kKa-TBIPCwiIZBT-SM79CafZfv8vt-fMegEEahdwCKGNu2wQlnBVA0MeYGspF1BMpxfI_yMEQp4UUQ6clIzZdws87UdALVfK2TFKqvRM1GavYDdZpsBynUb309zPTrEurFSH3M9bC_4_Qq1e9PuQKqxXSbZNvxC1TZp1oMcgCq58fdVn0nm1V_59TJ5AQcSqcbe-X_LoHWxXmrdmU0mpf1WrVhBKyCDUYrJu5QC5mEWzazCfWoJC6qSNOjDLmO18ZtxCRFHY-0Lcwc6iJp2UhySii1SQmc7bVhT4qg44vI8f_6LhJBT1RvW3XBObMIwzl7umcDP37wRejGXXFdvWlihLFpmTQ_gJiJyS8UcYTo
ContentType Journal Article
DBID JDI
ACYCR
DEWEY 610.28
DatabaseName [Open Access] KoreaScience
Korean Citation Index
DatabaseTitleList

DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Engineering
DocumentTitleAlternate Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification
EISSN 2288-9396
EndPage 115
ExternalDocumentID oai_kci_go_kr_ARTI_9967362
JAKO202217157860612
GroupedDBID 9ZL
ALMA_UNASSIGNED_HOLDINGS
JDI
ACYCR
ID FETCH-LOGICAL-k642-65412c570139786835e5f3d04f1e560daeb2b06f50ce3b726a5d0f780f9535583
ISSN 1229-0807
IngestDate Sun Mar 09 07:53:32 EDT 2025
Fri Dec 22 12:02:19 EST 2023
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Issue 2
Keywords Medical records classification
Natural language processing
Vectorization techniques
Latent semantic analysis
Machine learning
Language Korean
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-k642-65412c570139786835e5f3d04f1e560daeb2b06f50ce3b726a5d0f780f9535583
Notes KISTI1.1003/JNL.JAKO202217157860612
OpenAccessLink http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202217157860612&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01
PageCount 7
ParticipantIDs nrf_kci_oai_kci_go_kr_ARTI_9967362
kisti_ndsl_JAKO202217157860612
PublicationCentury 2000
PublicationDate 2022-04
PublicationDateYYYYMMDD 2022-04-01
PublicationDate_xml – month: 04
  year: 2022
  text: 2022-04
PublicationDecade 2020
PublicationTitle Journal of biomedical engineering research
PublicationTitleAlternate Journal of biomedical engineering research : the official journal of the Korean Society of Medical & Biological Engineering
PublicationYear 2022
Publisher 대한의용생체공학회
Publisher_xml – name: 대한의용생체공학회
SSID ssj0000605539
ssib053377025
ssib030194549
ssib036278799
ssib044763777
Score 1.789477
Snippet Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to...
Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to...
SourceID nrf
kisti
SourceType Open Website
Open Access Repository
StartPage 109
SubjectTerms 의공학
Title 의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석
URI http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202217157860612&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01
https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002837174
Volume 43
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
ispartofPNX 의공학회지, 2022, 43(2), , pp.109-115
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnR1NaxQxNLQ9iIKiVbF-lEHMqYxkPjKZHGe2I7VSvVTobZivSFnZhbW9eBCFPXrowdoKW7GHCkIPS2mlgr-oM_0Pvkxmdkep-HGZzea9fXnJy768l3l5QeheZjiRIbjQLRKnus0ipsfMJropDCEoLJBZJPchlx47C0_txRW6MjF5sRG1tL4W309ennmu5H-kCnUgV3lK9h8kOyIKFVAG-cITJAzPv5IxDlqYz2Pu4sDHXgv79hwOPOy72CdllQHguRroSnTXxrxVVvmO_CJhtKZAsS9hQLREDIA0VehQFWBOygIrqUPBUQ0CabOm4MiWFJLCrtsDJElKFjwC5uuIdcWLKXmF9gBdEgcAx9xWMII9XiEBOzXxEua2ZKRGg0_A4iXrXtkeafZUMTP_G3Nc5SEop2w2ztE4VyVDGm2aV6Pj1Z3zjarffmusQLsq3kmdH2vuq4BLPg7HUUuBaXId7GnWXCtUSqnqP2E2FL9BeMOGMNQR1Z_Te_-y7I6CIRe9R08kAwYzQH060uacRJOWIbX20qug1o6gibndeCcLlgco27EzaduwVDSSO4Idz1j9DlnZJYTS8mq9UdfAI5NuyioYVp2eaBhWy5fRpUoEmqem9xU00e5OowuNPJnT6NxSFQFyFXWLne18_1A7OR7muztavn9c9Ada_rWf723ne9-1YtA_3RxoxceNYmtYvD_UioPt_PN-sbUh8YqjQfHpDZDQ8oPd0_7w9MM7LR9-yQ82AUHL376Wv82_9U-ONiTNor9zDS0_CJZbC3p1aYjeBldal9famwllpWfjOuBfZFRYKbGFkYFxn0ZZbMbEEZQkmRUz04loSgRzieBU3jRgXUdTnW4nu4G0lJOIJTRhLgUz24zcxEpEDAuWJcw4dZwZNFsOXthJXzwPz5DiDLoLoxq2k9VQJnGXn8-6YbsXgqv6MORchlSaN_9E5RY6P56et9HUWm89uwOG8Fo8W86PHzUsoEQ
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%EC%9D%98%EB%AC%B4+%EA%B8%B0%EB%A1%9D+%EB%AC%B8%EC%84%9C+%EB%B6%84%EB%A5%98%EB%A5%BC+%EC%9C%84%ED%95%9C+%EC%9E%90%EC%97%B0%EC%96%B4+%EC%B2%98%EB%A6%AC%EC%97%90%EC%84%9C+%EC%B5%9C%EC%A0%81%EC%9D%98+%EB%B2%A1%ED%84%B0%ED%99%94+%EB%B0%A9%EB%B2%95%EC%97%90+%EB%8C%80%ED%95%9C+%EB%B9%84%EA%B5%90+%EB%B6%84%EC%84%9D&rft.jtitle=Journal+of+biomedical+engineering+research&rft.au=%EC%9C%A0%EC%84%B1%EB%A6%BC&rft.au=Yoo%2C+Sung+Lim&rft.date=2022-04-01&rft.issn=1229-0807&rft.volume=43&rft.issue=2&rft.spage=109&rft.epage=115&rft.externalDBID=n%2Fa&rft.externalDocID=JAKO202217157860612
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1229-0807&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1229-0807&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1229-0807&client=summon