The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

cs.AI updates on arXiv.org 13小时前

本文探讨了大型语言模型（LLMs）的自我预测能力，通过引入自我执行基准，评估模型预测其输出性质的能力，发现模型在此基准上的表现普遍不佳，且模型规模或能力的提升并不总是带来更好的表现。

arXiv:2508.12277v1 Announce Type: cross Abstract: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签