MarkTechPost@AI 04月11日 04:25
OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Ability for AI Agents to Browse the Web
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI 发布了 BrowseComp,这是一个新的基准测试,旨在评估 AI 智能体在网络上检索复杂信息的能力。虽然大型语言模型(LLMs)取得了进展,但在处理需要跨多个来源查找细微、依赖上下文的复杂信息时,AI 智能体仍面临挑战。BrowseComp 包含 1,266 个事实查找问题,需要持久的浏览、结构化推理和动态适应搜索策略。测试结果显示,Deep Research 模型表现最佳,而其他模型,包括 GPT-4o 和 GPT-4.5,在没有高级搜索或推理策略的情况下表现不佳。该基准测试强调了 AI 在网络浏览方面的关键能力,并为未来的研究提供了方向。

🧐 BrowseComp 是一个针对 AI 网络浏览能力的基准测试,包含 1,266 个事实查找问题,旨在评估 AI 智能体在检索复杂信息时的表现。

🤔 该基准测试的设计采用了“逆向提问”方法,从特定事实出发,构建难以通过简单搜索解决的问题,从而挑战 AI 的检索和推理能力。这些问题涵盖科学、历史、艺术、体育和娱乐等多个领域。

🚀 OpenAI 评估了多个模型在 BrowseComp 上的表现,包括 GPT-4o、GPT-4.5、OpenAI o1 和 Deep Research。结果显示,Deep Research 模型表现最佳,准确率为 51.5%,这表明专门为持久浏览设计的架构具有优势。

🧑‍💻 人类在解决 BrowseComp 问题时的表现也受到评估,人类训练者在每个问题上花费两小时,但大多数人未能解决多数问题,这突显了该基准测试的复杂性。

Despite advances in large language models (LLMs), AI agents still face notable limitations when navigating the open web to retrieve complex information. While many models excel on static knowledge benchmarks, they often underperform when tasked with locating nuanced, context-dependent facts across multiple sources. Most existing benchmarks evaluate a model’s recall of easily accessible knowledge, which does not reflect the intricacy of real-world browsing tasks. In contrast, agents operating in applied settings—whether assisting with research, summarizing policy, or fact-checking claims—require persistence, structured reasoning, and the ability to dynamically adapt their search strategies. These capabilities remain underdeveloped in current AI systems.

OpenAI Open Sources BrowseComp: A Benchmark of 1,266 Information-Seeking Tasks

To better evaluate these capabilities, OpenAI has released BrowseComp, a benchmark designed to assess agents’ ability to persistently browse the web and retrieve hard-to-find information. The benchmark includes 1,266 fact-seeking problems, each with a short, unambiguous answer. Solving these tasks often requires navigating through multiple webpages, reconciling diverse information, and filtering relevant signals from noise.

The benchmark is inspired by the notion that just as programming competitions serve as focused tests for coding agents, BrowseComp offers a similarly constrained yet revealing evaluation of web-browsing agents. It deliberately avoids tasks with ambiguous user goals or long-form outputs, focusing instead on the core competencies of precision, reasoning, and endurance.

BrowseComp is created using a reverse-question design methodology: beginning with a specific, verifiable fact, they constructed a question designed to obscure the answer through complexity and constraint. Human trainers ensured that questions could not be solved via superficial search and would challenge both retrieval and reasoning capabilities. Additionally, questions were vetted to ensure they would not be easily solvable by GPT-4, OpenAI o1, or earlier browsing-enabled models.

The dataset spans a broad range of domains—including science, history, arts, sports, and entertainment—and is balanced to promote topic diversity. Each task is formulated so that the correct answer is a short string, which simplifies evaluation and reduces ambiguity. Human performance was also assessed, with human trainers given two hours per task; most failed to solve the majority of tasks, reflecting their difficulty.

Model Evaluation and Findings

OpenAI evaluated several models on BrowseComp, including GPT-4o (with and without browsing), GPT-4.5, OpenAI o1, and Deep Research—a model specifically trained to handle persistent browsing tasks. The results indicate that models without advanced search or reasoning strategies perform poorly: GPT-4o without browsing achieved 0.6% accuracy, and with browsing enabled, only 1.9%. GPT-4.5 scored similarly low. OpenAI o1, with improved reasoning but no browsing, performed moderately better at 9.9%.

Deep Research outperformed all other models, achieving 51.5% accuracy. Its architecture and training emphasize iterative searching, evidence synthesis, and adaptive navigation. Performance improved further with multiple trials per question and aggregation strategies such as best-of-N selection and confidence-based voting. While Deep Research exhibited higher calibration error—frequently being overconfident in incorrect answers—it often identified its own correct outputs with internal consistency, suggesting a usable confidence signal.

Human Performance and Task Difficulty

Human trainers attempted to solve the benchmark problems without the assistance of AI tools. Of the 1,255 attempted tasks, 71% were marked as unsolvable within the two-hour window, and only 29% were successfully completed. Among those, the agreement rate with the reference answer was 86.4%. These outcomes underscore the complexity of the benchmark and suggest that current AI models still fall short of the adaptability and background reasoning skills needed for such tasks.

Conclusion

BrowseComp introduces a focused, verifiable, and technically demanding benchmark for evaluating the core capabilities of web-browsing agents. By shifting emphasis from static recall to dynamic retrieval and multi-hop reasoning, it presents a realistic challenge that aligns closely with emerging real-world applications. Although current models, including those with browsing capabilities, perform unevenly, the Deep Research agent illustrates the potential of dedicated architectures to bridge this gap.


BrowseComp is publicly available via GitHub and detailed on OpenAI’s official blog. Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Ability for AI Agents to Browse the Web appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI BrowseComp AI 网络浏览 基准测试
相关文章