의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석

Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical docu...

Full description

Saved in:
Bibliographic Details
Published inJournal of biomedical engineering research Vol. 43; no. 2; pp. 109 - 115
Main Authors 유성림, Yoo, Sung Lim
Format Journal Article
LanguageKorean
Published 대한의용생체공학회 01.04.2022
Subjects
Online AccessGet full text
ISSN1229-0807
2288-9396

Cover

Loading…
More Information
Summary:Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.
Bibliography:KISTI1.1003/JNL.JAKO202217157860612
ISSN:1229-0807
2288-9396