Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation

Background Generative artificial intelligence (AI) represents a potentially powerful, time-saving tool for grading student essays. However, little is known about how AI-generated essay scores compare to human instructor scores. Objective The purpose of this study was to compare the essay grading sco...

Full description

Saved in:

Bibliographic Details
Published in	Teaching of psychology Vol. 52; no. 3; pp. 298 - 304
Main Authors	Wetzler, Elizabeth L., Cassidy, Kenneth S., Jones, Margaret J., Frazier, Chelsea R., Korbut, Nickalous A., Sims, Chelsea M., Bowen, Shari S., Wood, Michael
Format	Journal Article
Language	English
Published	Los Angeles, CA SAGE Publications 01.07.2025 Taylor & Francis Ltd
Subjects	Artificial intelligence Chatbots Essays Grading Psychology Writing Instruction human instructor chatGPT scoring generative AI AI bias grading bias essay grading educational assessment artificial intelligence
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Background Generative artificial intelligence (AI) represents a potentially powerful, time-saving tool for grading student essays. However, little is known about how AI-generated essay scores compare to human instructor scores. Objective The purpose of this study was to compare the essay grading scores produced by AI with those of human instructors to explore similarities and differences. Method Eight human instructors and two versions of OpenAI's ChatGPT (3.5 and 4o) independently graded 186 deidentified student essays from an introductory psychology course using a detailed rubric. Scoring consistency was analyzed using Bland-Altman and regression analyses. Results AI scores for ChatGPT3.5 were, on average, higher than human scores, although average scores for ChatGPT 4o and human scores were more similar. Notably, AI grading for both versions was more lenient than human instructors at lower performance levels and stricter at higher levels, reflecting proportional bias. Conclusion Although AI may offer potential for supporting grading processes, the pattern of results suggests that AI and human instructors differ in how they score using the same rubric. Teaching Implications Results suggest that educators should be aware that AI grading of psychology writing assignments that require reflection or critical thinking may differ markedly from scores generated by human instructors.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0098-6283 1532-8023
DOI:	10.1177/00986283241282696