Closed-ended questions aren't as hard as you think

Published on February 19, 2025 3:53 AM GMT

Summary

In this short post, I argue that closed-ended questions, even those of arbitrary difficulty, are not as difficult as they may appear. In particular, I argue that the benchmark HLE is probably easier than it may first seem.^[1]

Specifically, I argue:

selection bias causes questions to skew easy

it's tough to write closed-ended questions about problems that are still partially open.

jargon-heavy questions that are relatively easy may be overrepresented

My background is in mathematics, so in this post I'll be focusing on issues that arise in math question-writing. (Currently, HLE is 41% math questions.)

#1 Selection bias causes questions to skew easy

HLE questions are crowdsourced. They are written by crowd workers (e.g., random PhD students with a free evening), and evaluated by a noisy process (time-constrained Scale.AI employees and LLMs).

Crowd workers are incentivized to get as much prize money as possible. Initially, HLE offered $500,000 of prize money: $5,000 each to the top 50 submissions, and $500 each to each of the next best top 500 submissions. Most people are risk averse. Given the structure of the prizes, why sink all of your time into writing one really good question, when you can instead submit several mediocre questions (and potentially get multiple prizes)?

Thus, the median person probably submitted a couple of "nice" questions they had on hand: questions that are easy to state and easy to write a solution for.^[2] They probably didn't go through the difficult exercise of thinking: what are some of the more thorny concepts in my subfield? How might I turn these into a tricky closed-ended question?

The question set is probably pretty good overall! My point is just that, conditional on a question coming from a specific expertise area, it probably skews easy, due to selection bias.

#1A Subpoint: it's tough to write closed-ended questions about problems that are still partially open (e.g., gap between lower and upper bounds)

In combinatorics and theoretical computer science,^[3] many questions are phrased in terms of giving lower and upper bounds. For example, someone might ask: "What is the greatest possible number of stable matchings that a single matching market instance can have?" (Knowing the problem details is not necessary here. If you like, replace the question with: "What is the greatest possible number of X that a problem instance can have?")

This is currently an open question. The best lower bound is ${2.28}^{n}$ , that is, there exists a matching market instance of size $n$ with that many stable matchings (source). The best upper bound is ${3.55}^{n}$ , that is, it has been mathematically proven that there cannot be a matching market instance of size $n$ with that many stable matchings (source).

Because this problem is open, it's tough to pose a closed-ended question about it.

{2.28}^{n}

{3.55}^{n}

could

{2.28}^{n}

Overall, whenever a research question is open in this way -- lower and upper bounds with a gap -- the only closed-ended questions that can be posed are the easiest parts of the problem.

#2 Easy jargon-heavy questions overrepresented, difficult but deceptively simple questions underrepresented

Quoting myself from several paragraphs ago:

HLE questions are crowdsourced. They are written by crowd workers (e.g., random PhD students with a free evening), and evaluated by a noisy process (time-constrained Scale.AI employees and LLMs).

In math, sometimes problems sound easy but are very difficult. See, for example, Erdős problems. A time-constrained question evaluator, even if they are an expert in a similar area, might not be able to fully grok the difficulty of a question from the problem statement and solution description alone.

In particular, things that can be hard to accurately estimate include:

Is the solution using super standard techniques, or is the solution surprising / different / using an ingenious "trick"?Is the jargon in the question deep (involves years of graduate study to understand), or relatively easy to pick up? (Answering this question might not just depend on the jargon itself, but also how it is invoked. For example, are implicit results pertaining to the jargon assumed when it is being used?)

What HLE evaluators are trying to select for is difficult, "research-level" questions. It's tough to answer the above two questions precisely, so inevitably they will have to use proxies. One practical proxy is how jargon-heavy a question is. (There may be others, such as solution length, but I am most confident in the point about jargon.)

Conclusion

For the above reasons, my model of the math questions in HLE is currently "test questions for first- and second-year PhD students", so similar to GPQA.^[4]

Accordingly, I view the next big open question on the path to building STEM AI to be the design of open-ended STEM benchmarks.

^{^}
I'm not saying HLE isn't hard or a useful benchmark! It is. I am just recommending people consider these points, and (if appropriate) downweigh their perception of how difficult the benchmark is.
^{^}
This is the case for the ~5 people I know who seriously submitted questions.
^{^}
I imagine similar principles apply to other fields outside of combinatorics and theoretical computer science. For example, in physics, people often apply approximations in creative ways. Perhaps, for similar reasons, it might be difficult to write closed-ended questions eliciting these skills.
^{^}
Comparing accuracy numbers on GPQA and HLE directly is misleading. GPQA is multiple-choice questions with 4 options, and HLE can be open-ended. (And with GPQA, the four answer choices have the potential to leak a lot of information about how to solve the question.)

Discuss

Summary

#1 Selection bias causes questions to skew easy

#1A Subpoint: it's tough to write closed-ended questions about problems that are still partially open (e.g., gap between lower and upper bounds)

#2 Easy jargon-heavy questions overrepresented, difficult but deceptively simple questions underrepresented

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签