Japanese Essay Scoring with Generative Pre-Trained Language Models Combined with Soft Labels

Automatic Essay Scoring is a crucial task aimed at alleviating the workload of essay graders. Most of the previous studies have been focused on English essays, primarily due to the availability of extensive scored essay datasets. Thus, it remains uncertain whether the models developed for English ar...

Full description

Saved in:
Bibliographic Details
Published inIEICE Transactions on Information and Systems Vol. E108.D; no. 9; pp. 1095 - 1107
Main Authors OKGETHENG, Boago, TAKEUCHI, Koichi
Format Journal Article
LanguageEnglish
Published The Institute of Electronics, Information and Communication Engineers 01.09.2025
一般社団法人 電子情報通信学会
Subjects
Online AccessGet full text
ISSN0916-8532
1745-1361
DOI10.1587/transinf.2024EDP7189

Cover

More Information
Summary:Automatic Essay Scoring is a crucial task aimed at alleviating the workload of essay graders. Most of the previous studies have been focused on English essays, primarily due to the availability of extensive scored essay datasets. Thus, it remains uncertain whether the models developed for English are applicable to smaller-scale Japanese essay datasets. Recent studies have demonstrated the successful application of BERT-based regression and ranking models. However, downloadable Japanese GPT models, which are larger than BERT, have become available, and it is unclear which types of modeling are appropriate for Japanese essay scoring. In this paper, we explore various aspects of modeling using GPTs, including the type of model (i.e., classification or regression), the size of the GPT models, and the approach to training (e.g., learning from scratch versus conducting continual pre-training). In experiments conducted with Japanese essay datasets, we demonstrate that classification models combined with soft labels are more effective for scoring Japanese essays compared to the simple classification models. Regarding the size of GPT models, we show that smaller models can produce better results depending on the model, type of prompt, and theme.
ISSN:0916-8532
1745-1361
DOI:10.1587/transinf.2024EDP7189