基准测试_Fishai

热点

"基准测试" 相关文章

多项力压 Grok 4、OpenAI o3，谷歌推出 Gemini 2.5 Deep Think 模型

IT之家 2025-08-01T14:36:00.000000Z

The Ultimate 2025 Guide to Coding LLM Benchmarks and Performance Metrics

MarkTechPost@AI 2025-07-31T08:54:46.000000Z

UserBench: An Interactive Gym Environment for User-Centric Agents

cs.AI updates on arXiv.org 2025-07-30T04:12:06.000000Z

CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

cs.AI updates on arXiv.org 2025-07-30T04:11:50.000000Z

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

少点错误 2025-07-29T12:09:32.000000Z

Benchmarking and Analyzing Generative Data for Visual Recognition

cs.AI updates on arXiv.org 2025-07-29T04:22:38.000000Z

Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design

cs.AI updates on arXiv.org 2025-07-29T04:22:20.000000Z

CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation

cs.AI updates on arXiv.org 2025-07-29T04:22:08.000000Z

Can LLMs Solve ASP Problems? Insights from a Benchmarking Study (Extended Version)

cs.AI updates on arXiv.org 2025-07-29T04:21:32.000000Z

OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

cs.AI updates on arXiv.org 2025-07-28T04:42:42.000000Z

GAIA基准测试介绍

掘金人工智能 2025-07-27T01:37:16.000000Z

AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data

cs.AI updates on arXiv.org 2025-07-25T04:28:53.000000Z

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

cs.AI updates on arXiv.org 2025-07-25T04:28:45.000000Z

Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios

cs.AI updates on arXiv.org 2025-07-25T04:28:32.000000Z

Benchmarks for AI in Software Engineering

Communications of the ACM - Artificial Intelligence 2025-07-24T16:13:44.000000Z

CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

cs.AI updates on arXiv.org 2025-07-24T05:31:05.000000Z

confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods

cs.AI updates on arXiv.org 2025-07-23T04:03:27.000000Z

SDBench: A Comprehensive Benchmark Suite for Speaker Diarization

cs.AI updates on arXiv.org 2025-07-23T04:03:19.000000Z

Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

cs.AI updates on arXiv.org 2025-07-23T04:03:12.000000Z

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.AI updates on arXiv.org 2025-07-22T04:34:40.000000Z

Copyright © 2019 FISHAI.All Rights Reserved