少点错误 05月05日 07:27
Notes on the Long Tasks METR paper, from a HCAST task contributor
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者审视了METR的Long Tasks论文,并分享了他的看法。他质疑了基线数据的可靠性,特别是对任务时间的估算。作者还探讨了任务隐私问题,指出在评估中,任务的隐私性可能并未得到充分保障。最后,作者认为这篇论文更像是一篇心理学研究,并呼吁读者以谨慎的态度看待其结论,同时强调了对人工智能安全性的持续关注。

🧐 对基线数据的质疑:作者认为,用于衡量任务时间的基线数据可能存在偏差,因为参与者可以放弃无法完成的任务,且有时间限制。此外,估算值也可能受到参与者经济利益的影响,导致数据失真。

🔑 任务隐私性的讨论:文章探讨了不同级别的任务隐私性,并指出METR的分析依赖于非完全私有的任务。当仅考虑完全私有的任务时,Singularity的预测时间被推迟,但作者对其结果的可靠性表示怀疑。

💡 对论文性质的评价:作者将这篇论文归类为心理学研究,并肯定了其研究的积极意义,但同时也提醒读者,应以适当的谨慎态度对待论文的结论。作者认为,在人工智能安全问题上,持怀疑态度的一方更容易获胜。

⚠️ 结论与建议:作者强调了对人工智能安全性的持续关注,并建议读者将该论文的结论视为一个上限,而不是绝对的预测。

Published on May 4, 2025 11:17 PM GMT

I contributed one (1) task to HCAST, which was used in METR’s Long Tasks paper. This gave me some thoughts I feel moved to share.

Regarding Baselines and Estimates

METR’s tasks have two sources for how long they take humans: most of those used in the paper were Baselined using playtesters under persistent scrutiny, and some were Estimated by METR.

I don’t quite trust the Baselines. Baseliners were allowed/incentivized to drop tasks they weren’t making progress with, and were – mostly, effectively, there’s some nuance here I’m ignoring – cut off at the eight-hour mark; Baseline times were found by averaging time taken for successful runs; this suggests Baseline estimates will be biased to be at least slightly too low, especially for more difficult tasks.[1]

I really, really don’t trust the Estimates[2]. My task was never successfully Baselined, so METR’s main source for how long it would take – aside from the lower bound from it never being successfully Baselined – is the number of hours my playtester reported. I was required to recruit and manage my own playtester, and we both got paid more the higher that number was: I know I was completely honest, and I have a very high degree of trust in the integrity of my playtester, but I remain disquieted by the financial incentive for contractors and subcontractors to exaggerate or lie.

When I reconstructed METR’s methodology and reproduced their headline results, I tried filtering for only Baselined tasks to see how that changed things. My answer . . .

. . . is that it almost entirely didn’t. Whether you keep or exclude the Estimated tasks, the log-linear regression still points at AIs doing month-long tasks in 2030 (if you look at the overall trend) or 2028 (if you only consider models since GPT-4o). My tentative explanation for this surprising lack of effect is that A) METR were consistently very good at adjusting away bias in their Estimates and/or B) most of the Estimated tasks were really difficult ones where AIs never won, so errors here had negligible effect on the shapes of logistic regression curves[3].

Regarding Task Privacy

HCAST tasks have four levels of Task Privacy:

METR’s analysis heavily depends on less-than-perfectly-Private tasks. When I tried redoing it on fully_private tasks only[4][5], the Singularity was rescheduled for mid-2039 (or mid-2032 if you drop everything pre-GPT-4o); I have no idea to what extent this is a fact about reality vs about small sample sizes resulting in strange results.

Also, all these privacy levels have “as far as we know” stapled to the end. My task is marked as fully_private, but if I’d reused some/all of the ideas in it elsewhere . . . or if I’d done that and then shared a solution . . . or if I’d done both of those things and then someone else had posted a snappy and condensed summary of the solution . . . it’s hard to say how METR could have found out or stopped me[6]. The one thing you can be sure of is that LLMs weren’t trained on tasks which were created after they were built (i.e. models before GPT-4o couldn’t have looked at my task because my task was created in April 2024)[7].

In Conclusion

The Long Tasks paper is a Psychology paper[8]. It’s the good kind of Psychology paper: it focuses on what minds can do instead of what they will do, it doesn’t show any signs of p-hacking, the inevitable biases seem to consistently point in the scarier direction so readers can use it as an upper bound[9], and it was written by hardworking clever people who sincerely care about reaching the right answer. But it’s still a Psychology paper, and should be taken with appropriate quantities of salt.
 

  1. ^

    A hypothetical task which takes a uniformly-distributed 1-10 hours would have about the same Baselined time estimate as one which takes a uniformly-distributed 1-100 hours conditional on them both having Baselined time estimates.

  2. ^

    I particularly don’t trust the Estimates for my task, because METR’s dataset says the easy version of it takes 18 hours and the hard version takes 10 hours, despite the easy version being easier than the hard version (due to it being the easy version).

  3. ^

    Note that this does not mean they will continue to have negligible effects on next year’s agents.

  4. ^

    I also filtered out all non-HCAST tasks: I wasn’t sure exactly how Private they were, but given that METR had been able to get their hands on the problems and solutions they couldn’t be that Private.

  5. ^

    To do this part of the reanalysis I dropped "GPT-2", "davinci-002 (GPT-3)" and "gpt-3.5-turbo-instruct", as none of these models were ever recorded succeeding on a fully_private task, making their task horizon undefined (ignoring the terminal-ful of warnings and proceeding anyway led to my modelling pipeline confusedly insisting that the End had happened in mid-2024 and I'd just been too self-absorbed to notice).

  6. ^

    I didn't do any of these things. I just take issue with how easily I’m implicitly being trusted.

  7. ^

    If this is the reason for the gradient discontinuity starting at GPT-4o I’m going to be so mad.

  8. ^

    The fact that it’s simultaneously a CompSci paper does not extenuate it.

  9. ^

    I sincerely mean this part. While I’m skeptical of AI Doom narratives, I’m extremely sympathetic to the idea that “this is safe” advocates are the ones who need airtight proofs, while “no it isn’t” counterarguers should be able to win just by establishing reasonable doubt.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

METR 长任务 人工智能 隐私 评估
相关文章