Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

cs.AI updates on arXiv.org 前天 19:10

本文探讨了Claude 3.5、DeepSeek v2、Gemini 2.5、GPT-4和Mistral 24B等五款大型语言模型在高等教育环境中自动评分的可靠性和有效性，通过评价意大利语学生论文，发现模型评分稳定性差，人机评分一致性低。

arXiv:2508.02442v1 Announce Type: cross Abstract: This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签