한국어 단어 임베딩을 위한 Word2vec 모델의 최적화

단어 임베딩 모델로 최근 인기를 끌고 있는 word2vec 모델을 한국어 처리에 적용하는 사례가 늘고 있다. Word2vec 모델에 대한 표준적인 성능 검증 방식은 유추 검사이지만, 최근까지 한국어에 적합한 유추 검사는 개발되지 않았다. 이런 이유로 한국어 word2vec 모델에 대한 하이퍼파라미터 최적화는 보통 유사도 검사를 통해 이루어졌다. 본 논문에서는 기존의 유사도 검사뿐만 아니라, 한국어의 언어학적 특성을 반영한 유추 검사를 이용해서 하이퍼파라미터 최적화를 시도했다. 그 결과, 학습 알고리즘으로는 skip-gram 방식이...

Full description

Saved in:

Bibliographic Details
Published in	디지털콘텐츠학회논문지 Vol. 20; no. 4; pp. 825 - 833
Main Authors	강형석(Hyungsuc Kang), 양장훈(Janghoon Yang)
Format	Journal Article
Language	Korean
Published	한국디지털콘텐츠학회 01.04.2019
Subjects	컴퓨터학 유사도 검사 Word2vec Analogy test Similarity test 유추 검사 Hyperparameter 하이퍼파라미터 Word embedding 단어 임베딩
Online Access	Get full text
ISSN	1598-2009 2287-738X
DOI	10.9728/dcs.2019.20.4.825

Cover

Abstract	단어 임베딩 모델로 최근 인기를 끌고 있는 word2vec 모델을 한국어 처리에 적용하는 사례가 늘고 있다. Word2vec 모델에 대한 표준적인 성능 검증 방식은 유추 검사이지만, 최근까지 한국어에 적합한 유추 검사는 개발되지 않았다. 이런 이유로 한국어 word2vec 모델에 대한 하이퍼파라미터 최적화는 보통 유사도 검사를 통해 이루어졌다. 본 논문에서는 기존의 유사도 검사뿐만 아니라, 한국어의 언어학적 특성을 반영한 유추 검사를 이용해서 하이퍼파라미터 최적화를 시도했다. 그 결과, 학습 알고리즘으로는 skip-gram 방식이 CBOW보다 우수하고, 단어 벡터의 크기는 300 차원이 적절하며, 문맥 윈도의 크기는 5에서 10 사이가 적절함을 발견하였다. 또한, 말뭉치의 크기에 따라서 학습될 어휘 수를 적절하게 제한하는 데 사용되는 최소 출현빈도 값은 총 어휘 수가 100만개 이하일 경우에는 1로 설정하여 가급적 학습될 어휘 수를 적정 수준으로 유지하는 것이 중요함을 확인하였다. In Korean language processing, there are more and more cases of applying word2vec models, which are recently gaining popularity as word embedding models. Analogy tests are used as standard evaluation methods for word2vec models; however, no analogy test suitable for Korean has been developed yet. For this reason, similarity tests have been employed in optimizing hyperparameters for Korean word2vec models. This paper attempts to optimize some of these hyperparameters through the existing similarity test as well as a new analogy test that reflects certain features intrinsic to the Korean language. It turns out that the training algorithm of skip-gram is better than that of CBOW, the optimal dimension of word vectors is 300 and the optimal size of the context window lies between 5 and 10. It is also found that keeping the size of vocabulary trained in the corpus at a reasonable level is critical, which result in setting the hyperparameter of minimum count as 1 for the size of vocabulary less than one million. KCI Citation Count: 24
AbstractList	단어 임베딩 모델로 최근 인기를 끌고 있는 word2vec 모델을 한국어 처리에 적용하는 사례가 늘고 있다. Word2vec 모델에 대한 표준적인 성능 검증 방식은 유추 검사이지만, 최근까지 한국어에 적합한 유추 검사는 개발되지 않았다. 이런 이유로 한국어 word2vec 모델에 대한 하이퍼파라미터 최적화는 보통 유사도 검사를 통해 이루어졌다. 본 논문에서는 기존의 유사도 검사뿐만 아니라, 한국어의 언어학적 특성을 반영한 유추 검사를 이용해서 하이퍼파라미터 최적화를 시도했다. 그 결과, 학습 알고리즘으로는 skip-gram 방식이 CBOW보다 우수하고, 단어 벡터의 크기는 300 차원이 적절하며, 문맥 윈도의 크기는 5에서 10 사이가 적절함을 발견하였다. 또한, 말뭉치의 크기에 따라서 학습될 어휘 수를 적절하게 제한하는 데 사용되는 최소 출현빈도 값은 총 어휘 수가 100만개 이하일 경우에는 1로 설정하여 가급적 학습될 어휘 수를 적정 수준으로 유지하는 것이 중요함을 확인하였다. In Korean language processing, there are more and more cases of applying word2vec models, which are recently gaining popularity as word embedding models. Analogy tests are used as standard evaluation methods for word2vec models; however, no analogy test suitable for Korean has been developed yet. For this reason, similarity tests have been employed in optimizing hyperparameters for Korean word2vec models. This paper attempts to optimize some of these hyperparameters through the existing similarity test as well as a new analogy test that reflects certain features intrinsic to the Korean language. It turns out that the training algorithm of skip-gram is better than that of CBOW, the optimal dimension of word vectors is 300 and the optimal size of the context window lies between 5 and 10. It is also found that keeping the size of vocabulary trained in the corpus at a reasonable level is critical, which result in setting the hyperparameter of minimum count as 1 for the size of vocabulary less than one million. KCI Citation Count: 24
Author	양장훈(Janghoon Yang) 강형석(Hyungsuc Kang)
Author_xml	– sequence: 1 fullname: 강형석(Hyungsuc Kang) – sequence: 2 fullname: 양장훈(Janghoon Yang)
BackLink	https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002462856$$DAccess content in National Research Foundation of Korea (NRF)
BookMark	eNotjDFLw0AYhg-pYK39AW5ZHBwSv_vurncZS61aKBakoFtIcomEaiIJCo5Ct-JWRaGVjlVwEQv-JpP-B6N2eZ93eHg2SSVO4oCQbQqWLVHtaT-zEKhdjsUthWKNVBGVNCVTZxVSpcJWJgLYG6SeZZEHgjEpEbFKmsuHyffivXj8NPLR_BfFyzD_mOXj12I6NIrJsBSM0yTVeBP4Rv42z--_iumTUSwmxexu-TzeIuuhe5EF9RVrpH_Q7reOzG7vsNNqds24wcFE38VQ-shDYJQGyhZaUB5Sqb2GBgSPNzzBfAQlaUBtHzRoRm2QIAKluGI1svufjdPQGfiRk7jRH88TZ5A6zZN-xxGMC86gdHdW7nUaXQY6cp2r8rjprXPc22-DAgpUAvsBbDRrvw
ContentType	Journal Article
DBID	DBRKI TDB ACYCR
DOI	10.9728/dcs.2019.20.4.825
DatabaseName	DBPIA - 디비피아 Korean Database (DBpia) Korean Citation Index
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
DocumentTitleAlternate	Optimization of Word2vec Models for Korean Word Embeddings
DocumentTitle_FL	Optimization of Word2vec Models for Korean Word Embeddings
EISSN	2287-738X
EndPage	833
ExternalDocumentID	oai_kci_go_kr_ARTI_5345430 NODE08010170
GroupedDBID	ALMA_UNASSIGNED_HOLDINGS DBRKI M~E TDB ACYCR
ID	FETCH-LOGICAL-n640-2ca2f7c24f0311e895d514f17db6d020b46b53c20871e19c0d0d3190705e88483
ISSN	1598-2009
IngestDate	Wed Apr 23 03:14:59 EDT 2025 Thu Feb 06 14:01:33 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	4
Keywords	유사도 검사 Word2vec Analogy test Similarity test 유추 검사 Hyperparameter 하이퍼파라미터 Word embedding 단어 임베딩
Language	Korean
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-n640-2ca2f7c24f0311e895d514f17db6d020b46b53c20871e19c0d0d3190705e88483
Notes	http://dx.doi.org/10.9728/dcs.2019.20.4.825
PageCount	9
ParticipantIDs	nrf_kci_oai_kci_go_kr_ARTI_5345430 nurimedia_primary_NODE08010170
PublicationCentury	2000
PublicationDate	2019-04
PublicationDateYYYYMMDD	2019-04-01
PublicationDate_xml	– month: 04 year: 2019 text: 2019-04
PublicationDecade	2010
PublicationTitle	디지털콘텐츠학회논문지
PublicationYear	2019
Publisher	한국디지털콘텐츠학회
Publisher_xml	– name: 한국디지털콘텐츠학회
SSID	ssib053377222 ssib008451574 ssib049971524 ssib036278589 ssib053682487
Score	2.059844
Snippet	단어 임베딩 모델로 최근 인기를 끌고 있는 word2vec 모델을 한국어 처리에 적용하는 사례가 늘고 있다. Word2vec 모델에 대한 표준적인 성능 검증 방식은 유추 검사이지만, 최근까지 한국어에 적합한 유추 검사는 개발되지 않았다. 이런 이유로 한국어 word2vec 모델에 대한...
SourceID	nrf nurimedia
SourceType	Open Website Publisher
StartPage	825
SubjectTerms	컴퓨터학
Title	한국어 단어 임베딩을 위한 Word2vec 모델의 최적화
URI	https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE08010170 https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002462856
Volume	20
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	디지털콘텐츠학회논문지, 2019, 20(4), , pp.825-833
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrR1NaxQxNLT1oBdRVKwfZRBzKlNnMskkOc7MjtRK66VCPQ07mVkrC7tl3RX0IAh7E29VFFrpsQpexIK_yZ3-B18ys7OjFvxAWGYeSd6bl7wk771s8oLQTUk6bSVVx6YdqmzadjM7ZURv-yNZxqSjaK7XIdc3_NX7dG2Lbc3N7zR2LY2G6Yp6duK5kn-RKqSBXPUp2b-QbE0UEgAG-cITJAzPP5IxjltYMiwjHAc4ZDho4TjC0schXcZxiEWIA9FMAijGguq8kODA0YCkOJAmqwVZZaHIFKqI6wX1jDzJlaEZBIYmEG_hUFR4Uhi8sOQl0pSFawhIXN4uPDWAqy_CTxfjeqNFrL9b0Qo1LZMCpByTIgyfJS_SAAEWkeGA4dAAQVShG4LTPmQaxdF4saGqAVMxYJiI1acwzT0eqeW7ZsVczrBMc0m3aq2gRIevCcBag8LbfRgxDxpY1aKJKxt7bXQ3P0k8_6f2TVUiTW8tOclNGgH_1OaeuUy51j_EaYwz2lAmojwSPrVLyoAhP6s8yYk-xpEpHXve1UevVuhKjflDJPGNe60YHAQ9CTvz6BTh3GxsWH8e1zOwoGDfzhwCsG64YLP7CMA95mDw1fngK4B3NgsAyTxfEGpupayrX24l0Gze-oVJMOh6A7ADT_dG-jILmBEbxt3mOXS28sqsoBxi59Fct38BBcev974dfSrefLEmLw_1q3g_nnw-mOx-KPbHVrE3hgLWdHBYk4-Hk1dfi_23VnG0Vxy8OH63exFt3o43o1W7unHE7vnUsYlqkw5XhHZA1bm5kCwDf6Lj8iz1M_CrUuqnzFPEEdzNXamczMlAhYHWZLkQVHiX0EKv38svIwvcIv2XPfehtWjOctnWrhll3EsplyRdRDeg5klXPUp0gHf9fthPuoME3Ng7CfMoo56ziJbqhkl2yugzSVOKV35X4Co6M-v-19DCcDDKr4MVPUyXjOC_Aw8Rllw
linkProvider	ISSN International Centre
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%ED%95%9C%EA%B5%AD%EC%96%B4+%EB%8B%A8%EC%96%B4+%EC%9E%84%EB%B2%A0%EB%94%A9%EC%9D%84+%EC%9C%84%ED%95%9C+Word2vec+%EB%AA%A8%EB%8D%B8%EC%9D%98+%EC%B5%9C%EC%A0%81%ED%99%94&rft.jtitle=%EB%94%94%EC%A7%80%ED%84%B8%EC%BD%98%ED%85%90%EC%B8%A0%ED%95%99%ED%9A%8C%EB%85%BC%EB%AC%B8%EC%A7%80&rft.au=%EA%B0%95%ED%98%95%EC%84%9D%28Hyungsuc+Kang%29&rft.au=%EC%96%91%EC%9E%A5%ED%9B%88%28Janghoon+Yang%29&rft.date=2019-04-01&rft.pub=%ED%95%9C%EA%B5%AD%EB%94%94%EC%A7%80%ED%84%B8%EC%BD%98%ED%85%90%EC%B8%A0%ED%95%99%ED%9A%8C&rft.issn=1598-2009&rft.eissn=2287-738X&rft.volume=20&rft.issue=4&rft.spage=825&rft.epage=833&rft_id=info:doi/10.9728%2Fdcs.2019.20.4.825&rft.externalDocID=NODE08010170
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1598-2009&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1598-2009&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1598-2009&client=summon