热点
关于我们
xx
xx
"
基准测试
" 相关文章
多项力压 Grok 4、OpenAI o3,谷歌推出 Gemini 2.5 Deep Think 模型
IT之家
2025-08-01T14:36:00.000000Z
The Ultimate 2025 Guide to Coding LLM Benchmarks and Performance Metrics
MarkTechPost@AI
2025-07-31T08:54:46.000000Z
UserBench: An Interactive Gym Environment for User-Centric Agents
cs.AI updates on arXiv.org
2025-07-30T04:12:06.000000Z
CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting
cs.AI updates on arXiv.org
2025-07-30T04:11:50.000000Z
About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
少点错误
2025-07-29T12:09:32.000000Z
Benchmarking and Analyzing Generative Data for Visual Recognition
cs.AI updates on arXiv.org
2025-07-29T04:22:38.000000Z
Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design
cs.AI updates on arXiv.org
2025-07-29T04:22:20.000000Z
CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation
cs.AI updates on arXiv.org
2025-07-29T04:22:08.000000Z
Can LLMs Solve ASP Problems? Insights from a Benchmarking Study (Extended Version)
cs.AI updates on arXiv.org
2025-07-29T04:21:32.000000Z
OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
cs.AI updates on arXiv.org
2025-07-28T04:42:42.000000Z
GAIA基准测试介绍
掘金 人工智能
2025-07-27T01:37:16.000000Z
AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data
cs.AI updates on arXiv.org
2025-07-25T04:28:53.000000Z
TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios
cs.AI updates on arXiv.org
2025-07-25T04:28:45.000000Z
Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios
cs.AI updates on arXiv.org
2025-07-25T04:28:32.000000Z
Benchmarks for AI in Software Engineering
Communications of the ACM - Artificial Intelligence
2025-07-24T16:13:44.000000Z
CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos
cs.AI updates on arXiv.org
2025-07-24T05:31:05.000000Z
confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods
cs.AI updates on arXiv.org
2025-07-23T04:03:27.000000Z
SDBench: A Comprehensive Benchmark Suite for Speaker Diarization
cs.AI updates on arXiv.org
2025-07-23T04:03:19.000000Z
Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark
cs.AI updates on arXiv.org
2025-07-23T04:03:12.000000Z
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
cs.AI updates on arXiv.org
2025-07-22T04:34:40.000000Z