Published on May 4, 2025 11:17 PM GMT
I contributed one (1) task to HCAST, which was used in METR’s Long Tasks paper. This gave me some thoughts I feel moved to share.
Regarding Baselines and Estimates
METR’s tasks have two sources for how long they take humans: most of those used in the paper were Baselined using playtesters under persistent scrutiny, and some were Estimated by METR.
I don’t quite trust the Baselines. Baseliners were allowed/incentivized to drop tasks they weren’t making progress with, and were – mostly, effectively, there’s some nuance here I’m ignoring – cut off at the eight-hour mark; Baseline times were found by averaging time taken for successful runs; this suggests Baseline estimates will be biased to be at least slightly too low, especially for more difficult tasks.[1]
I really, really don’t trust the Estimates[2]. My task was never successfully Baselined, so METR’s main source for how long it would take – aside from the lower bound from it never being successfully Baselined – is the number of hours my playtester reported. I was required to recruit and manage my own playtester, and we both got paid more the higher that number was: I know I was completely honest, and I have a very high degree of trust in the integrity of my playtester, but I remain disquieted by the financial incentive for contractors and subcontractors to exaggerate or lie.
When I reconstructed METR’s methodology and reproduced their headline results, I tried filtering for only Baselined tasks to see how that changed things. My answer . . .
. . . is that it almost entirely didn’t. Whether you keep or exclude the Estimated tasks, the log-linear regression still points at AIs doing month-long tasks in 2030 (if you look at the overall trend) or 2028 (if you only consider models since GPT-4o). My tentative explanation for this surprising lack of effect is that A) METR were consistently very good at adjusting away bias in their Estimates and/or B) most of the Estimated tasks were really difficult ones where AIs never won, so errors here had negligible effect on the shapes of logistic regression curves[3].
Regarding Task Privacy
HCAST tasks have four levels of Task Privacy:
- “fully_private” means the task was only seen and used by METR.“public_problem” means the task is available somewhere online (so it could have made it into the training data), but the solution isn’t.“public_solution” means the task and solution are both available somewhere online.“easy_to_memorize” means the solution is public and seemed particularly easy for an LLM to memorize from its training data.
METR’s analysis heavily depends on less-than-perfectly-Private tasks. When I tried redoing it on fully_private tasks only[4][5], the Singularity was rescheduled for mid-2039 (or mid-2032 if you drop everything pre-GPT-4o); I have no idea to what extent this is a fact about reality vs about small sample sizes resulting in strange results.
Also, all these privacy levels have “as far as we know” stapled to the end. My task is marked as fully_private, but if I’d reused some/all of the ideas in it elsewhere . . . or if I’d done that and then shared a solution . . . or if I’d done both of those things and then someone else had posted a snappy and condensed summary of the solution . . . it’s hard to say how METR could have found out or stopped me[6]. The one thing you can be sure of is that LLMs weren’t trained on tasks which were created after they were built (i.e. models before GPT-4o couldn’t have looked at my task because my task was created in April 2024)[7].
In Conclusion
The Long Tasks paper is a Psychology paper[8]. It’s the good kind of Psychology paper: it focuses on what minds can do instead of what they will do, it doesn’t show any signs of p-hacking, the inevitable biases seem to consistently point in the scarier direction so readers can use it as an upper bound[9], and it was written by hardworking clever people who sincerely care about reaching the right answer. But it’s still a Psychology paper, and should be taken with appropriate quantities of salt.
- ^
A hypothetical task which takes a uniformly-distributed 1-10 hours would have about the same Baselined time estimate as one which takes a uniformly-distributed 1-100 hours conditional on them both having Baselined time estimates.
- ^
I particularly don’t trust the Estimates for my task, because METR’s dataset says the easy version of it takes 18 hours and the hard version takes 10 hours, despite the easy version being easier than the hard version (due to it being the easy version).
- ^
Note that this does not mean they will continue to have negligible effects on next year’s agents.
- ^
I also filtered out all non-HCAST tasks: I wasn’t sure exactly how Private they were, but given that METR had been able to get their hands on the problems and solutions they couldn’t be that Private.
- ^
To do this part of the reanalysis I dropped "GPT-2", "davinci-002 (GPT-3)" and "gpt-3.5-turbo-instruct", as none of these models were ever recorded succeeding on a fully_private task, making their task horizon undefined (ignoring the terminal-ful of warnings and proceeding anyway led to my modelling pipeline confusedly insisting that the End had happened in mid-2024 and I'd just been too self-absorbed to notice).
- ^
I didn't do any of these things. I just take issue with how easily I’m implicitly being trusted.
- ^
If this is the reason for the gradient discontinuity starting at GPT-4o I’m going to be so mad.
- ^
The fact that it’s simultaneously a CompSci paper does not extenuate it.
- ^
I sincerely mean this part. While I’m skeptical of AI Doom narratives, I’m extremely sympathetic to the idea that “this is safe” advocates are the ones who need airtight proofs, while “no it isn’t” counterarguers should be able to win just by establishing reasonable doubt.
Discuss