少点错误 07月29日 20:09
About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

FutureHouse公司在测试其文学研究AI代理时,发现“Humanity’s Last Exam”(HLE)生物和化学问题中存在显著错误。通过专家团队的验证,发现高达29%的文本问题答案与同行评审文献存在直接冲突。研究指出,HLE在构建过程中可能存在激励机制问题,导致部分问题质量不高,甚至包含琐碎知识或存在争议的答案。例如,关于“2002年地球上最稀有的惰性气体”的问题被认为是琐事,且答案“Oganesson”的性质和存在状态与科学认知不符。另一个关于无菌环境中单剂量容器穿刺后的使用期限问题,其答案“1小时”的依据也存在争议,专家认为应立即使用。这些发现表明,即使是声称代表博士级别研究的基准测试,也可能存在准确性问题,需要更严格的验证和细致的专家审查。

🔬 **HLE基准测试的准确性问题**:FutureHouse公司发现,在“Humanity’s Last Exam”(HLE)生物和化学子集问题中,约有29%的答案与同行评审文献存在直接冲突,表明该基准测试的准确性有待提高。这可能源于构建基准测试时的激励机制设计不当。

🧪 **“博士级别研究”的定义与现实差距**:文章以“2002年地球上最稀有的惰性气体”为例,质疑HLE问题是否真正达到博士级别研究的深度,认为其更偏向于琐碎的知识记忆。同时,针对“Oganesson”的答案,指出其性质尚未完全确定,且在2002年的科学文献中并未被列为地球的构成物质,这与HLE提供的答案存在矛盾。

💊 **对药物使用期限的误读与验证困难**:关于无菌环境中单剂量容器(ampule)穿刺后的使用期限问题,HLE答案为“1小时”。然而,研究团队和专家根据USP <797>的解读,认为该1小时限制仅适用于单剂量药瓶(vial),而非药安瓿(ampule),后者应立即使用。这表明HLE的答案和理由可能基于对相关规定的片面理解或错误引用。

🦋 **科学文献检索的挑战与答案的“唯一性”**:对于“Raphidiopterans(蛇蛉)成虫以何为食”的问题,HLE答案为“花蜜”。研究人员却难以找到支持蛇蛉以花蜜为食的直接证据,反而发现有研究表明它们主要以蚜虫为食,且存在关于蛇蛉不分布于新西兰的证据,这使得HLE答案的“唯一性”受到质疑,与领域专家的研究发现相悖。

📊 **基准测试的审查流程与潜在漏洞**:HLE的初始审查流程存在局限性,如审查者无需花费超过5分钟来验证答案的完整准确性,并且在第二轮筛选中,虽然寻找高质量的审查,但最终仍依赖于众包系统进行后期完善。这种流程设计可能导致了问题的疏漏和错误答案的产生。

Published on July 29, 2025 11:59 AM GMT

FutureHouse is a company that builds literature research agents. They tested it on the bio + chem subset of HLE questions, then noticed errors in them.

The post's first paragraph:

Humanity’s Last Exam has become the most prominent eval representing PhD-level research. We found the questions puzzling and investigated with a team of experts in biology and chemistry to evaluate the answer-reasoning pairs in Humanity’s Last Exam. We found that 29 ± 3.7% (95% CI) of the text-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature. We believe this arose from the incentive used to build the benchmark. Based on human experts and our own research tools, we have created an HLE Bio/Chem Gold, a subset of AI and human validated questions. 

About the initial review process for HLE questions:

[...] Reviewers were given explicit instructions: “Questions should ask for something precise and have an objectively correct, univocal answer.” The review process was challenging, and performed in two rounds. First reviewers rated questions for domain alignment and complexity, but notably reviewers were not required to verify the full accuracy of a question’s rationale if it would take “more than 5 minutes” (https://arxiv.org/abs/2501.14249). A second round of question selection was performed to find the best questions via having high-quality reviews. Finally, a bug-bounty crowd sourcing system was used for post-release refinement.

So question writers had to verify frontier models don't get the questions correct. Reviewers only had to spend 5 minutes reviewing them. Reviewers didn’t need to verify correctness.

Some examples they show:

Here’s an example question: What was the rarest noble gas on Earth as as a percentage of all terrestrial matter in 2002? Answer: Oganesson

First, we would argue this is not a “PhD level research” question, but instead trivia. [...] Only five atoms of oganesson have ever been made and no properties have been measured. Predictions show that it will be solid, not a gas, and some argue it is not noble because of how reactive it is (10.1002/anie.202011976). [...] There are also multiple peer-reviewed papers published (including in 2002) that list the terrestrial fractions of noble gases and oganesson is not considered as part of terrestrial matter (10.2138/rmg.2002.47.1).

[...]

HLE Question 1: What is the BUD for a single dose container ampule from the time of puncture in a sterile environment?

HLE Answer: 1 hour

HLE Rationale: The Beyond-Use Date (BUD) for a single-dose ampule is determined by the United States Pharmacopeia (USP). [...] It is indeed "best practice" to use an SDC -ampule immediately, however in the USP <797> it states the BUD for an SDC - ampule is 1 hour from the time of puncture or opening. This distinction in time is important as "immediately" is not the same as "1 hour since puncture." [...]

For Question 1, we were unable to validate the rationale, and in our reading of USP <797> it appears that the 1-hour limit in sterile environments only applies to single dose vials, but not ampules. Our reading, as corroborated by an independent expert, is that the correct answer is to use the ampule immediately. One source of confusion that we see deep research style tools make is from slide 45 on this powerpoint (https://www.pharmacy.ca.gov/meetings/agendas/2019/19_mar_cmpd_mat.pdf), which can mislead to believing the 1-hour time limit. This may have led to the original HLE incorrect answer and is often how deep research style tools recapitulate the HLE answer.

HLE Question 2: Which of the following have Raphidiopterans been recorded feeding on as adults?

Answer Choices:

A. Nectar

B. Māhoe pollen

C. Fungus

D. Karamū leaf tissue

E. Totara Aphids

F. A and E

G. D and E

HLE Answer: A

HLE Rationale: Raphidiopterans (snakeflies) are predatory insects, but several papers have recorded them feeding on nectar. The common names given in B, D and E are endemic to New Zealand. No snakefly has ever been found in New Zealand, so there will be no records of them feeding on those species.

We were unable to find good sources for Question 2’s claim of snakeflies feeding on nectar. Several sources remark that other Orders of the Neuropterida Super Order have been recorded eating nectar, but not specifically Raphidiopteran. For example, Jepson and Head examined the stomachs of deceased Neuropterida and consistently found predation of aphids and occasionally pollen, but no nectar (10.11646/zootaxa.4098.1.5). Aldritch and Zhang wrote a review on the ecology of Neuroptera and pointed out some eat nectar (Chrysopa), but not Neuropterida (nor Raphidiopterans) (10.1146/annurev-ento-010715-023507). While the cited work mentioned in the rationale may exist, we were unable to find any specific examples for either the Inocelliidae or Raphidiidae families eating nectar.

Maybe someone saw a Raphidiopterans eat nectar once, which is extremely out of character, and recorded it somewhere in a way that makes keyword search impossible. Certainly, that answer is not a “univocal” answer, and a domain expert doing research would find the answer “nectar” to be widely contradicted from the sources mentioned above.

I initially noticed this post from spiderduckpig in the Manifold Discord server.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI研究代理 Humanity’s Last Exam 基准测试 AI准确性 科学文献验证
相关文章