少点错误 02月19日
Closed-ended questions aren't as hard as you think
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者认为,HLE(很难的知识水平评估)题目的难度可能被高估了。原因在于,众包模式下,出题者倾向于选择容易上手的问题,导致题目难度整体偏低。此外,对于那些存在上下界差距的开放性问题,很难设计出既能考察知识又能避免泄露答案的封闭式问题。同时,评估者在时间有限的情况下,容易被题目中晦涩的专业术语所迷惑,从而高估了题目的难度。因此,作者建议重新评估HLE的难度。

💰选择偏差导致题目难度整体偏低。众包出题模式下,出题者更倾向于选择容易陈述和解答的问题,而不是那些需要深入思考的难题,以增加获奖几率。

🤔对于存在上下界差距的开放性问题,很难设计出有效的封闭式题目。因为无法验证是否存在更优的上下界,导致只能考察问题中最简单的部分。

👨‍🏫评估者容易被晦涩的专业术语所迷惑,从而高估题目的难度。评估者在时间有限的情况下,难以准确判断题目是否需要巧妙的技巧或深刻的理解,因此容易被术语丰富的题目所误导。

Published on February 19, 2025 3:53 AM GMT

Summary

In this short post, I argue that closed-ended questions, even those of arbitrary difficulty, are not as difficult as they may appear. In particular, I argue that the benchmark HLE is probably easier than it may first seem.[1]

Specifically, I argue:

My background is in mathematics, so in this post I'll be focusing on issues that arise in math question-writing. (Currently, HLE is 41% math questions.) 

#1 Selection bias causes questions to skew easy

HLE questions are crowdsourced. They are written by crowd workers (e.g., random PhD students with a free evening), and evaluated by a noisy process (time-constrained Scale.AI employees and LLMs). 

Crowd workers are incentivized to get as much prize money as possible. Initially, HLE offered $500,000 of prize money: $5,000 each to the top 50 submissions, and $500 each to each of the next best top 500 submissions. Most people are risk averse. Given the structure of the prizes, why sink all of your time into writing one really good question, when you can instead submit several mediocre questions (and potentially get multiple prizes)?

Thus, the median person probably submitted a couple of "nice" questions they had on hand: questions that are easy to state and easy to write a solution for.[2] They probably didn't go through the difficult exercise of thinking: what are some of the more thorny concepts in my subfield? How might I turn these into a tricky closed-ended question?

The question set is probably pretty good overall! My point is just that, conditional on a question coming from a specific expertise area, it probably skews easy, due to selection bias. 

#1A Subpoint: it's tough to write closed-ended questions about problems that are still partially open (e.g., gap between lower and upper bounds)

In combinatorics and theoretical computer science,[3] many questions are phrased in terms of giving lower and upper bounds. For example, someone might ask: "What is the greatest possible number of stable matchings that a single matching market instance can have?" (Knowing the problem details is not necessary here. If you like, replace the question with: "What is the greatest possible number of X that a problem instance can have?")

This is currently an open question. The best lower bound is , that is, there exists a matching market instance of size  with that many stable matchings (source). The best upper bound is , that is, it has been mathematically proven that there cannot be a matching market instance of size  with that many stable matchings (source). 

Because this problem is open, it's tough to pose a closed-ended question about it. 

    You can't ask to produce the best possible lower bound. What if there is a construction better than ? There is no way to verify this in a closed-ended environment.You can't ask to prove an upper bound. There are potentially many different ways to prove upper bounds for this problem, even if the true bound is , and you have no way of asking this question in a closed-ended way.You could ask for it to produce a construction that yields  (it's probably possible to get around uniqueness concerns)... but this question is much easier than the rest of the research question is. It's easier to solve a problem like this when you have a target to aim for. 

Overall, whenever a research question is open in this way -- lower and upper bounds with a gap -- the only closed-ended questions that can be posed are the easiest parts of the problem.

#2 Easy jargon-heavy questions overrepresented, difficult but deceptively simple questions underrepresented 

Quoting myself from several paragraphs ago:

HLE questions are crowdsourced. They are written by crowd workers (e.g., random PhD students with a free evening), and evaluated by a noisy process (time-constrained Scale.AI employees and LLMs). 

In math, sometimes problems sound easy but are very difficult. See, for example, Erdős problems. A time-constrained question evaluator, even if they are an expert in a similar area, might not be able to fully grok the difficulty of a question from the problem statement and solution description alone. 

In particular, things that can be hard to accurately estimate include:

What HLE evaluators are trying to select for is difficult, "research-level" questions. It's tough to answer the above two questions precisely, so inevitably they will have to use proxies. One practical proxy is how jargon-heavy a question is. (There may be others, such as solution length, but I am most confident in the point about jargon.) 

Conclusion

For the above reasons, my model of the math questions in HLE is currently "test questions for first- and second-year PhD students", so similar to GPQA.[4] 

Accordingly, I view the next big open question on the path to building STEM AI to be the design of open-ended STEM benchmarks. 

 

  1. ^

    I'm not saying HLE isn't hard or a useful benchmark! It is. I am just recommending people consider these points, and (if appropriate) downweigh their perception of how difficult the benchmark is. 

  2. ^

    This is the case for the ~5 people I know who seriously submitted questions.

  3. ^

    I imagine similar principles apply to other fields outside of combinatorics and theoretical computer science. For example, in physics, people often apply approximations in creative ways. Perhaps, for similar reasons, it might be difficult to write closed-ended questions eliciting these skills. 

  4. ^

    Comparing accuracy numbers on GPQA and HLE directly is misleading. GPQA is multiple-choice questions with 4 options, and HLE can be open-ended. (And with GPQA, the four answer choices have the potential to leak a lot of information about how to solve the question.) 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

HLE 难度评估 选择偏差 开放性问题 专业术语
相关文章