cs.AI updates on arXiv.org 前天 19:10
Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了Claude 3.5、DeepSeek v2、Gemini 2.5、GPT-4和Mistral 24B等五款大型语言模型在高等教育环境中自动评分的可靠性和有效性,通过评价意大利语学生论文,发现模型评分稳定性差,人机评分一致性低。

arXiv:2508.02442v1 Announce Type: cross Abstract: This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 自动评分 高等教育 可靠性 有效性
相关文章