Automatic item generation in various STEM subjects using large language model prompting

Large language models (LLMs) that power chatbots such as ChatGPT have capabilities across numerous domains. Teachers and students have been increasingly using chatbots in science, technology, engineering, and mathematics (STEM) subjects in various ways, including for assessment purposes. However, th...

Full description

Saved in:
Bibliographic Details
Published inComputers and education. Artificial intelligence Vol. 8; p. 100344
Main Authors Chan, Kuang Wen, Ali, Farhan, Park, Joonhyeong, Sham, Kah Shen Brandon, Tan, Erdalyn Yeh Thong, Chong, Francis Woon Chien, Qian, Kun, Sze, Guan Kheng
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.06.2025
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Large language models (LLMs) that power chatbots such as ChatGPT have capabilities across numerous domains. Teachers and students have been increasingly using chatbots in science, technology, engineering, and mathematics (STEM) subjects in various ways, including for assessment purposes. However, there has been a lack of systematic investigation into LLMs’ capabilities and limitations in automatically generating items for STEM subject assessments, especially given that LLMs can hallucinate and may risk promoting misconceptions and hindering conceptual understanding. To address this, we systematically investigated LLMs' conceptual understanding and quality of working in generating question and answer pairs across various STEM subjects. We used prompt engineering on GPT-3.5 and GPT-4 with three different approaches: standard prompting, standard prompting with added chain-of-thought prompting using worked examples with steps, and the chain-of-thought prompting with coding language. The questions and answer pairs were generated at the pre-university level in the three STEM subjects of chemistry, physics, and mathematics and evaluated by subject-matter experts. We found that LLMs generated quality questions when using the chain-of-thought prompting for both GPT-3.5 and GPT-4 and when using the chain-of-thought prompting with coding language for GPT-4 overall. However, there were varying patterns in generating multistep answers, with differences in final answer and intermediate step accuracy. An interesting finding was that the chain-of-thought prompting with coding language for GPT-4 significantly outperformed the other approaches in generating correct final answers while demonstrating moderate accuracy in generating multistep answers correctly. In addition, through qualitative analysis, we identified domain-specific prompting patterns across the three STEM subjects. We then discussed how our findings aligned with, contradicted, and contributed to the current body of knowledge on automatic item generation research using LLMs, and the implications for teachers using LLMs to generate STEM assessment items.
ISSN:2666-920X
2666-920X
DOI:10.1016/j.caeai.2024.100344