MIT Technology Review » Artificial Intelligence 06月23日 23:58
A Chinese firm has just launched a constantly changing set of AI benchmarks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Xbench是由红杉中国(HSG)开发的新型AI模型评估基准,旨在更全面地衡量AI模型的性能。与传统基准不同,Xbench不仅评估模型在学术测试中的表现,还关注其在实际任务中的应用能力。该基准包含科学问答和深度研究两个部分,前者考察模型在STEM领域的知识,后者则侧重模型在中文互联网上的信息处理能力。此外,Xbench还设计了模拟实际工作流程的任务,例如招聘和市场营销,以评估模型在现实世界中的实用性。Xbench定期更新,并开放部分问题供公众使用,为AI模型的评估提供了新的视角。

💡Xbench包含两个核心组成部分:Xbench-ScienceQA和Xbench-DeepResearch。ScienceQA类似于传统的学术测试,评估模型在生物化学、轨道力学等领域的知识水平,并强调推理过程;DeepResearch则侧重于模型在中文互联网上的信息检索和处理能力,要求模型能够从多个来源获取信息并进行整合。

💼Xbench还设计了基于实际工作流程的任务,以评估模型的实际应用价值。例如,招聘任务要求模型筛选合格的电池工程师候选人并给出理由,市场营销任务则要求模型匹配广告商和短视频创作者。

📊Xbench定期更新,并提供公开排行榜。ChatGPT-o3在多个类别中排名第一,其他表现优异的模型包括ByteDance的Doubao、Gemini 2.5 Pro、Grok和Claude Sonnet等。

🌐为了保持基准的活力,Xbench团队承诺每季度更新测试题目,并维护一个半公开、半私有的数据集。未来,Xbench还将增加对模型创造力、协作能力和可靠性等方面的评估。

When testing an AI model, it’s hard to tell if it is reasoning or just regurgitating answers from its training data. Xbench, a new benchmark developed by the Chinese venture capital firm HSG, or HongShan Capital Group, might help to sidestep that issue. That’s thanks to the way it evaluates models not only on the ability to pass arbitrary tests, like most other benchmarks, but also on the ability to execute real-world tasks, which is more unusual. It will be updated on a regular basis to try to keep it evergreen. 

This week the company is making part of its question set open-source and letting anyone use for free. The team has also released a leaderboard comparing how mainstream AI models stack up when tested on Xbench. (ChatGPT o3 ranked first across all categories, though ByteDance’s Doubao, Gemini 2.5 Pro, and Grok all still did pretty well, as did Claude Sonnet.) 

Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.

Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.

Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.

DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)

On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is.

The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.

To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers.

The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.

ChatGPT-o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.

“It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Xbench AI模型 基准测试 红杉中国
相关文章