의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석
Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical docu...
Saved in:
Published in | Journal of biomedical engineering research Vol. 43; no. 2; pp. 109 - 115 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | Korean |
Published |
대한의용생체공학회
01.04.2022
|
Subjects | |
Online Access | Get full text |
ISSN | 1229-0807 2288-9396 |
Cover
Abstract | Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification. |
---|---|
AbstractList | Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic med- ical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization tech- niques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vec- torization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification. KCI Citation Count: 0 Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification. |
Author | 유성림 Yoo, Sung Lim |
Author_xml | – sequence: 1 fullname: 유성림 – sequence: 2 fullname: Yoo, Sung Lim |
BackLink | https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002837174$$DAccess content in National Research Foundation of Korea (NRF) |
BookMark | eNotjs9KAkEAxocwyMx3mEuHDguzM87szlGkP5YghPdhdXdiW9kFtwco2GMHD5kKa-TBIPCwiIZBT-SM79CafZfv8vt-fMegEEahdwCKGNu2wQlnBVA0MeYGspF1BMpxfI_yMEQp4UUQ6clIzZdws87UdALVfK2TFKqvRM1GavYDdZpsBynUb309zPTrEurFSH3M9bC_4_Qq1e9PuQKqxXSbZNvxC1TZp1oMcgCq58fdVn0nm1V_59TJ5AQcSqcbe-X_LoHWxXmrdmU0mpf1WrVhBKyCDUYrJu5QC5mEWzazCfWoJC6qSNOjDLmO18ZtxCRFHY-0Lcwc6iJp2UhySii1SQmc7bVhT4qg44vI8f_6LhJBT1RvW3XBObMIwzl7umcDP37wRejGXXFdvWlihLFpmTQ_gJiJyS8UcYTo |
ContentType | Journal Article |
DBID | JDI ACYCR |
DEWEY | 610.28 |
DatabaseName | [Open Access] KoreaScience Korean Citation Index |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Engineering |
DocumentTitleAlternate | Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification |
EISSN | 2288-9396 |
EndPage | 115 |
ExternalDocumentID | oai_kci_go_kr_ARTI_9967362 JAKO202217157860612 |
GroupedDBID | 9ZL ALMA_UNASSIGNED_HOLDINGS JDI ACYCR |
ID | FETCH-LOGICAL-k642-65412c570139786835e5f3d04f1e560daeb2b06f50ce3b726a5d0f780f9535583 |
ISSN | 1229-0807 |
IngestDate | Sun Mar 09 07:53:32 EDT 2025 Fri Dec 22 12:02:19 EST 2023 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Issue | 2 |
Keywords | Medical records classification Natural language processing Vectorization techniques Latent semantic analysis Machine learning |
Language | Korean |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-k642-65412c570139786835e5f3d04f1e560daeb2b06f50ce3b726a5d0f780f9535583 |
Notes | KISTI1.1003/JNL.JAKO202217157860612 |
OpenAccessLink | http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202217157860612&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01 |
PageCount | 7 |
ParticipantIDs | nrf_kci_oai_kci_go_kr_ARTI_9967362 kisti_ndsl_JAKO202217157860612 |
PublicationCentury | 2000 |
PublicationDate | 2022-04 |
PublicationDateYYYYMMDD | 2022-04-01 |
PublicationDate_xml | – month: 04 year: 2022 text: 2022-04 |
PublicationDecade | 2020 |
PublicationTitle | Journal of biomedical engineering research |
PublicationTitleAlternate | Journal of biomedical engineering research : the official journal of the Korean Society of Medical & Biological Engineering |
PublicationYear | 2022 |
Publisher | 대한의용생체공학회 |
Publisher_xml | – name: 대한의용생체공학회 |
SSID | ssj0000605539 ssib053377025 ssib030194549 ssib036278799 ssib044763777 |
Score | 1.789477 |
Snippet | Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to... Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to... |
SourceID | nrf kisti |
SourceType | Open Website Open Access Repository |
StartPage | 109 |
SubjectTerms | 의공학 |
Title | 의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석 |
URI | http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202217157860612&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01 https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002837174 |
Volume | 43 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
ispartofPNX | 의공학회지, 2022, 43(2), , pp.109-115 |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnR1NaxQxNLQ9iIKiVbF-lEHMqYxkPjKZHGe2I7VSvVTobZivSFnZhbW9eBCFPXrowdoKW7GHCkIPS2mlgr-oM_0Pvkxmdkep-HGZzea9fXnJy768l3l5QeheZjiRIbjQLRKnus0ipsfMJropDCEoLJBZJPchlx47C0_txRW6MjF5sRG1tL4W309ennmu5H-kCnUgV3lK9h8kOyIKFVAG-cITJAzPv5IxDlqYz2Pu4sDHXgv79hwOPOy72CdllQHguRroSnTXxrxVVvmO_CJhtKZAsS9hQLREDIA0VehQFWBOygIrqUPBUQ0CabOm4MiWFJLCrtsDJElKFjwC5uuIdcWLKXmF9gBdEgcAx9xWMII9XiEBOzXxEua2ZKRGg0_A4iXrXtkeafZUMTP_G3Nc5SEop2w2ztE4VyVDGm2aV6Pj1Z3zjarffmusQLsq3kmdH2vuq4BLPg7HUUuBaXId7GnWXCtUSqnqP2E2FL9BeMOGMNQR1Z_Te_-y7I6CIRe9R08kAwYzQH060uacRJOWIbX20qug1o6gibndeCcLlgco27EzaduwVDSSO4Idz1j9DlnZJYTS8mq9UdfAI5NuyioYVp2eaBhWy5fRpUoEmqem9xU00e5OowuNPJnT6NxSFQFyFXWLne18_1A7OR7muztavn9c9Ada_rWf723ne9-1YtA_3RxoxceNYmtYvD_UioPt_PN-sbUh8YqjQfHpDZDQ8oPd0_7w9MM7LR9-yQ82AUHL376Wv82_9U-ONiTNor9zDS0_CJZbC3p1aYjeBldal9famwllpWfjOuBfZFRYKbGFkYFxn0ZZbMbEEZQkmRUz04loSgRzieBU3jRgXUdTnW4nu4G0lJOIJTRhLgUz24zcxEpEDAuWJcw4dZwZNFsOXthJXzwPz5DiDLoLoxq2k9VQJnGXn8-6YbsXgqv6MORchlSaN_9E5RY6P56et9HUWm89uwOG8Fo8W86PHzUsoEQ |
linkProvider | ISSN International Centre |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%EC%9D%98%EB%AC%B4+%EA%B8%B0%EB%A1%9D+%EB%AC%B8%EC%84%9C+%EB%B6%84%EB%A5%98%EB%A5%BC+%EC%9C%84%ED%95%9C+%EC%9E%90%EC%97%B0%EC%96%B4+%EC%B2%98%EB%A6%AC%EC%97%90%EC%84%9C+%EC%B5%9C%EC%A0%81%EC%9D%98+%EB%B2%A1%ED%84%B0%ED%99%94+%EB%B0%A9%EB%B2%95%EC%97%90+%EB%8C%80%ED%95%9C+%EB%B9%84%EA%B5%90+%EB%B6%84%EC%84%9D&rft.jtitle=Journal+of+biomedical+engineering+research&rft.au=%EC%9C%A0%EC%84%B1%EB%A6%BC&rft.au=Yoo%2C+Sung+Lim&rft.date=2022-04-01&rft.issn=1229-0807&rft.volume=43&rft.issue=2&rft.spage=109&rft.epage=115&rft.externalDBID=n%2Fa&rft.externalDocID=JAKO202217157860612 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1229-0807&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1229-0807&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1229-0807&client=summon |