TY - JOUR
T1 - Assessing the quality of automatic-generated short answers using GPT-4
AU - Rodrigues, Luiz
AU - Dwan Pereira, Filipe
AU - Cabral, Luciano
AU - Gašević, Dragan
AU - Ramalho, Geber
AU - Ferreira Mello, Rafael
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/12
Y1 - 2024/12
N2 - Open-ended assessments play a pivotal role in enabling instructors to evaluate student knowledge acquisition and provide constructive feedback. Integrating large language models (LLMs) such as GPT-4 in educational settings presents a transformative opportunity for assessment methodologies. However, existing literature on LLMs addressing open-ended questions lacks breadth, relying on limited data or overlooking question difficulty levels. This study evaluates GPT-4's proficiency in responding to open-ended questions spanning diverse topics and cognitive complexities in comparison to human responses. To facilitate this assessment, we generated a dataset of 738 open-ended questions across Biology, Earth Sciences, and Physics and systematically categorized it based on Bloom's Taxonomy. Each question included eight human-generated responses and two from GPT-4. The outcomes indicate GPT-4's superior performance over humans, encompassing both native and non-native speakers, irrespective of gender. Nevertheless, this advantage was not sustained in ’remembering’ or ’creating’ questions aligned with Bloom's Taxonomy. These results highlight GPT-4's potential for underpinning advanced question-answering systems, its promising role in supporting non-native speakers, and its capacity to augment teacher assistance in assessments. However, limitations in nuanced argumentation and creativity underscore areas necessitating refinement in these models, guiding future research toward bolstering pedagogical support.
AB - Open-ended assessments play a pivotal role in enabling instructors to evaluate student knowledge acquisition and provide constructive feedback. Integrating large language models (LLMs) such as GPT-4 in educational settings presents a transformative opportunity for assessment methodologies. However, existing literature on LLMs addressing open-ended questions lacks breadth, relying on limited data or overlooking question difficulty levels. This study evaluates GPT-4's proficiency in responding to open-ended questions spanning diverse topics and cognitive complexities in comparison to human responses. To facilitate this assessment, we generated a dataset of 738 open-ended questions across Biology, Earth Sciences, and Physics and systematically categorized it based on Bloom's Taxonomy. Each question included eight human-generated responses and two from GPT-4. The outcomes indicate GPT-4's superior performance over humans, encompassing both native and non-native speakers, irrespective of gender. Nevertheless, this advantage was not sustained in ’remembering’ or ’creating’ questions aligned with Bloom's Taxonomy. These results highlight GPT-4's potential for underpinning advanced question-answering systems, its promising role in supporting non-native speakers, and its capacity to augment teacher assistance in assessments. However, limitations in nuanced argumentation and creativity underscore areas necessitating refinement in these models, guiding future research toward bolstering pedagogical support.
KW - Automatic answer generation
KW - GPT-4
KW - Large language models
KW - Natural language processing
KW - Question-answering
UR - http://www.scopus.com/inward/record.url?scp=85197064688&partnerID=8YFLogxK
U2 - 10.1016/j.caeai.2024.100248
DO - 10.1016/j.caeai.2024.100248
M3 - Article
AN - SCOPUS:85197064688
SN - 2666-920X
VL - 7
JO - Computers and Education: Artificial Intelligence
JF - Computers and Education: Artificial Intelligence
M1 - 100248
ER -