cs.AI updates on arXiv.org 07月23日 12:03
ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍ResearcherBench,首个评估高级AI科研系统(DARS)能力的基准,通过专家选取的65个研究问题,评估其洞察力、准确性及全面性,并展示OpenAI Deep Research和Gemini Deep Research在开放咨询问题上的优势。

arXiv:2507.16280v1 Announce Type: new Abstract: The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI科研系统 ResearcherBench 能力评估 OpenAI Gemini Deep Research
相关文章