Unite.AI 06月03日 07:22
How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了FutureSearch团队发布的Deep Research Bench (DRB) 报告,该报告对大型语言模型(LLMs)在“深度研究”任务中的表现进行了全面评估。报告揭示了当前AI研究助手在多步推理、信息整合和网络数据处理方面的能力与不足。通过对不同AI模型的测试,DRB 强调了模型在处理复杂任务时面临的挑战,如信息遗忘、工具使用循环以及缺乏来源验证等问题。尽管顶尖模型表现出色,但仍未达到熟练人类研究员的水平。

🔍 Deep Research Bench (DRB) 是一个专门设计的基准测试,用于评估AI在多步骤、基于网络的“深度研究”任务中的表现,这些任务模拟了现实世界中研究人员面临的挑战。

🥇 OpenAI的o3在DRB测试中表现最佳,但即使是最好的模型,其表现也未完全超越人类研究人员。Claude 3.7 Sonnet和Gemini 2.5 Pro也表现出色,分别在不同方面展现了优势。

🧠 模型在处理复杂任务时面临诸多挑战,包括信息遗忘、工具使用循环、查询质量差以及过早得出结论。此外,模型在交叉验证信息和验证调查结果方面也存在不足。

💡 即使是“无工具”的AI模型(仅依赖内部训练数据)在某些任务上表现出色,但在需要最新信息或实时查询能力的任务中,其表现明显下降。这表明深度研究不仅依赖于记忆,还依赖于推理和验证信息。

As large language models (LLMs) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re not just answering simple factual questions—they’re tackling “deep research” tasks, which involve multi-step reasoning, evaluating conflicting information, sourcing data from across the web, and synthesizing it into a coherent output.

This emerging capability is now being marketed under different brand names by major labs—OpenAI calls it “Deep Research”, Anthropic refers to it as “Extended Thinking”, Google’s Gemini offers “Search + Pro” features, and Perplexity labels theirs “Pro Search” or “Deep Research”. But how effective are these offerings in practice? A new report by FutureSearch, titled Deep Research Bench (DRB): Evaluating Web Research Agents, offers the most rigorous evaluation to date—and the results reveal both impressive capabilities and critical shortcomings.

What Is Deep Research Bench?

Created by the FutureSearch team, Deep Research Bench is a meticulously constructed benchmark designed to assess AI agents' performance on multi-step, web-based research tasks. These aren't simple questions with straightforward answers—they reflect the messy, open-ended challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 distinct tasks across 8 categories such as:

Each task type is carefully structured with human-verified answers and evaluated using a frozen dataset of scraped web pages, known as RetroSearch. This ensures consistency across model evaluations, avoiding the fluctuating state of the live web.

The Agent Architecture: ReAct and RetroSearch

At the heart of Deep Research Bench lies the ReAct architecture, short for “Reason + Act.” This method mimics how a human researcher might tackle a problem—by thinking through the task, taking an action like performing a web search, observing the results, and then deciding whether to iterate or conclude.

While earlier models follow this loop explicitly, newer “thinking” models often streamline the process, embedding reasoning more fluidly into their actions. To ensure consistency across evaluations, DRB introduces RetroSearch—a custom-built, static version of the web. Rather than relying on the live internet, which constantly changes, agents tap into a curated archive of web pages scraped using tools like Serper, Playwright, and ScraperAPI. The scale is impressive: for high-complexity tasks such as “Gather Evidence,” RetroSearch can provide access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.

Which AI Agents Perform Best?

Among all the contenders, OpenAI’s o3 emerged as the top performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. While that might sound modest, it’s important to understand the benchmark’s difficulty: due to ambiguity in task definitions and scoring, even a flawless agent would likely top out around 0.8—what researchers call the “noise ceiling.” In other words, even the best models today still fall short of well-informed, methodical human researchers.

Still, the leaderboard offers revealing insights. o3 not only led the pack but did so with speed and consistency, showing strong performance across nearly all task types. Claude 3.7 Sonnet from Anthropic followed closely, demonstrating versatility in both its “thinking” and “non-thinking” modes. Gemini 2.5 Pro, Google’s flagship model, stood out for its ability to handle tasks requiring structured planning and step-by-step reasoning. Meanwhile, the open-weight DeepSeek-R1 delivered a pleasant surprise—keeping pace with GPT-4 Turbo and narrowing the performance gap between open and closed models.

Across the board, a clear pattern emerged: newer, “thinking-enabled” models consistently outperformed their earlier counterparts, and closed-source models maintained a notable edge over open-weight alternatives.

Where Do Agents Struggle?

Reading through the failure patterns highlighted in the Deep Research Bench report felt surprisingly familiar. One of the most frustrating aspects I’ve personally encountered—especially during long research or content creation sessions—is when an AI agent simply forgets what we were doing. As the context window stretches, the model often begins to lose the thread: key details fade, goals get muddled, and suddenly, the responses feel disjointed or aimless. At some point, I’ve learned it’s often better to cut losses and start from scratch, even if it means throwing away everything that’s been generated so far.

That kind of forgetfulness isn’t just anecdotal—it’s the most significant predictor of failure in the Deep Research Bench evaluation. But it’s not the only recurring issue. The report also highlights how some models fall into repetitive tool use, running the same search over and over as if stuck in a loop. Others show poor query crafting, lazily keyword-matching instead of thinking critically about how to search effectively. And far too often, agents fall victim to premature conclusions—delivering a half-formed answer that technically checks the box but falls short of real insight.

Even among the top models, the differences are stark. GPT-4 Turbo, for example, showed a notable tendency to forget prior steps, while DeepSeek-R1 was more likely to hallucinate or invent plausible-sounding—but incorrect—information. Across the board, models frequently failed to cross-check sources or validate findings before finalizing their output. For anyone who’s relied on AI for serious work, these issues will feel all too familiar—and they underscore how far we still have to go in building agents that can truly think and research like humans.

What About Memory-Based Performance?

Interestingly, Deep Research Bench also evaluated what it calls “toolless” agents—language models operating without any access to external tools, such as web search or document retrieval. These agents rely entirely on their internal training data and memory, generating answers based solely on what they've previously learned during training. In practice, this means they can’t look anything up or verify information—they’re guessing based on what they “remember.”

Surprisingly, these toolless agents performed almost as well as full research agents on certain tasks. For example, on the Validate Claim task—where the goal is to assess the plausibility of a statement—they scored 0.61, nearly matching the 0.62 average of tool-enabled agents. This suggests that models like o3 and Claude have strong internal priors and can often recognize the truthfulness of common claims without needing to search the web.

But on more demanding tasks—like Derive Number, which requires piecing together multiple values from various sources, or Gather Evidence, which depends on finding and evaluating diverse facts in context—these toolless models completely fell apart. Without fresh information or real-time lookup capabilities, they simply lacked the means to produce accurate or comprehensive answers.

This contrast highlights an important nuance: while today’s LLMs can simulate “knowing” a lot, deep research depends not just on recall, but on reasoning with up-to-date, verifiable information—something only tool-augmented agents can truly deliver.

Final Thoughts

The DRB report makes one thing clear: while today’s best AI agents can outpace average humans on narrowly defined tasks, they still lag behind skilled generalist researchers—especially when it comes to planning strategically, adapting mid-process, and reasoning with nuance.

This gap becomes especially obvious during long or complex sessions—something I’ve experienced firsthand, where an agent gradually loses track of the task’s purpose, leading to a frustrating breakdown in coherence and utility.

What makes Deep Research Bench so valuable is that it doesn’t just test surface-level knowledge—it probes the intersection of tool use, memory, reasoning, and adaptation, offering a closer analog to real-world research than benchmarks like MMLU or GSM8k.

As LLMs continue to integrate into serious knowledge work, FutureSearch tools like DRB will be essential for assessing not just what these systems know, but how well they actually work.

The post How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report appeared first on Unite.AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI研究 Deep Research Bench 大型语言模型 AI评估
相关文章