少点错误 07月06日 19:49
New Paper: It is time to move on from MCQs for LLM Evaluations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了使用多项选择题(MCQ)进行人工智能(AI)基准测试的局限性。研究发现,在某些情况下,语言模型无需理解问题即可通过MCQ,尤其是在多模态数据集中。为了解决这个问题,研究者提出了一种新的评估方法——答案匹配。这种方法使用语言模型将生成的答案与真实答案进行匹配,结果显示,即使使用小型开源模型,答案匹配的准确性也优于MCQ,并且成本更低。研究还强调了从MCQ转向生成式评估的重要性,这可以改变模型排名,并为数据集的改进提供空间。

🧐 **MCQ的局限性:** 传统的多项选择题(MCQ)评估方法存在缺陷,语言模型可以通过仅关注选项来猜测答案,而无需真正理解问题,导致评估结果失真。这种“捷径”在MMLU-Pro、SuperGPQA等基准测试中尤为明显,甚至在多模态基准测试中也存在,模型无需查看图像即可作答。

💡 **答案匹配的优势:** 相比之下,答案匹配方法通过让语言模型生成答案,然后将生成的答案与真实答案进行匹配,从而更准确地评估模型的生成能力。这种方法在MATH等可验证领域表现出色,并且在MMLU-Pro和GPQA-Diamond等自由形式推理基准测试中,即使使用小型模型也能实现近乎完美的对齐。

💰 **成本效益:** 答案匹配方法不仅在评估准确性上优于MCQ,而且在成本上也更具优势。研究表明,使用Qwen3-4B等小型模型进行答案匹配,其成本低于MCQ,特别是在启用思维链(CoT)的情况下,MCQ的成本会因为模型生成更长的输出而增加。

Published on July 6, 2025 11:48 AM GMT

New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations.

TLDR: Using MCQs for AI benchmarking is problematic--you can guess the answer without even looking at the question (in multimodal MCQ datasets, without the image!). We knew this, but there didn't seem any alternative. We show now that language models are good enough, using small open-source ones to match generative responses to a ground-truth reference answer works much better, and turns out to be cheaper than MCQ evals!

Discriminative Shortcuts in MCQ

We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image.

Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. In the Appendix, we discuss how discrimination is easier than even verification, let alone generation.

Shortcuts are exacerbated by the recent trend of using LLMs to create MCQs. However, they are still significant in MMLU, which consists of human-designed exams like GRE and USMLE. These results are with a Qwen3-4B based classifier, but even DeBerta gets high shortcut accuracy.

But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... 

Generative Evaluations with Answer Matching

We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer.

We conduct a meta-evaluation comparing Answer Matching to LLM-as-a-Judge without reference answers, MCQ, and also some non-discriminative variants of MCQ used recently like MC-Verify (for eg in Virology Capabilities Test) and MC-Cloze. We first compare the evaluations in a domain where ground-truth verification is possible, MATH, using the recently released MATH-MC variant for comparisons. 

Note how the non-discriminative styles of MCQ show reduced accuracy similar to generative evaluation (Left). But accuracy is not all you need from evals. They should be aligned at a sample-level with ground-truth verification, so we can study where models are right/wrong. From the alignment plot (Right), it becomes clear: 

But we don't need LMs for verifiable domains. Rather we need them for tasks with unconstrained answers prone to "paraphrases" that are semantically equivalent. So we manually grade generative responses on free-form versions of frontier reasoning benchmarks which have arbitrary textual answers: MMLU-Pro and GPQA-Diamond. For human grading, we freely use the internet, calculators and more such tools to increase the accuracy. 

Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced.

Impacts on Benchmarking

This is not merely a theoretical concern. Switching from MCQ to generative evaluations changes model rankings. Further, accuracies drop, and datasets that seem saturated start showing room for improvement.

A common rebuttal is that LLM based evaluations are expensive. We show this is not true anymore. We don't need frontier API models, for answer matching Qwen3-4B might be enough. Surprisingly, with CoT enabled, MCQ costs more as models give longer outputs.

So instead of creating harder MCQs, we should focus our efforts on creating questions for answer matching, much like SimpleQA, GAIA, and parts of HLE. For example, either make questions specific enough to have a single semantic (LLMs can handle paraphrasing) answer, or list the multiple correct solutions that are possible.

We release our code, and annotations for subsets of MMLU-Pro and GPQA which have a unique semantic answer.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 评估 多项选择题 答案匹配 基准测试
相关文章