사회언어학 연구를 위한 한국어 미세조정 언어모델

This paper aims to test deep-learning-based Korean language models’ capacity to learn and detect social registers embedded in speech data, specifically age, gender, and regional dialects. A comprehensive understanding of linguistic phenomena requires contextualizing speech based on speakers’ age, ge...

Full description

Saved in:

Bibliographic Details
Published in	사회언어학 Vol. 32; no. 3; pp. 41 - 64
Main Authors	노강산(Kangsan Noh), 김수연(Soo-Yeon Kim), 최혜원(Hye-Won Choi), 장하연(Hayeun Jang), 송상헌(Sanghoun Song)
Format	Journal Article
Language	Korean
Published	한국사회언어학회 30.09.2024 The Sociolinguistic Society Of Korea
Subjects	언어학 Korean language model dialect gender age social register
Online Access	Get full text
ISSN	1226-4822

Cover

More Information
Summary:	This paper aims to test deep-learning-based Korean language models’ capacity to learn and detect social registers embedded in speech data, specifically age, gender, and regional dialects. A comprehensive understanding of linguistic phenomena requires contextualizing speech based on speakers’ age, gender, and geographic background, along with the processing of syntactic structures. To bridge the gap between human language understanding and model processing, we fine-tuned three representative Korean language models—KR-BERT, KoELECTRA-base, and KLUE-RoBERTa-base—using transcribed data from 4,000 hours of speech by middle-aged and elderly Korean speakers. The findings reveal that KoELECTRA-base outperformed the other two models across all social registers, which is likely attributed to its larger vocabulary and parameters size. Among the dialects, the Jeju dialect showed the highest accuracy in inference, which is attributed to its distinctiveness, making it easier for the models to detect. In addition to the fine-tuning process, we have made our fine-tuned models publicly available to support researchers interested in Korean computational sociolinguistics.
ISSN:	1226-4822