MarkTechPost@AI 2024年10月11日
ScienceAgentBench: A Rigorous AI Evaluation Framework for Language Agents in Scientific Discovery
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ScienceAgentBench是用于评估语言代理在数据驱动发现中的强大基准。它基于科学真实性、严格分级评估和仔细多阶段质量控制的原则构建,涵盖来自四个科学学科的102个多样化任务,旨在评估语言代理在自动化科学工作流程中的能力,揭示其在实际科学场景中的优势和局限性。

🎯ScienceAgentBench是一个全面的评估框架,旨在评估语言代理在数据驱动发现工作流程中的关键任务上的表现。它将每个任务构建为代码生成问题,要求代理根据自然语言指令、数据集信息和可选的专家提供的知识生成可执行的Python程序。

📚该基准从四个科学学科的44篇同行评审出版物中精心挑选了102个不同的任务,以确保其具有现实世界的相关性,并尽量减少泛化差距。其构建过程涉及任务注释、数据污染缓解、专家验证和注释者验证等多个环节。

💪对语言代理在ScienceAgentBench上的评估显示,Claude - 3.5 - Sonnet表现最佳,在无专家知识时成功率为32.4%,使用自调试框架并结合专家知识时成功率为34.3%。自调试方法效果显著,使Claude - 3.5 - Sonnet的成功率相比直接提示几乎翻倍。

🔍尽管有一定成果,但评估结果表明当前语言代理在复杂任务上仍存在困难,特别是在涉及专业工具或如生物信息学和计算化学等领域的异构数据处理方面。

Large language models (LLMs) have emerged as powerful tools capable of performing complex tasks beyond text generation, including reasoning, tool learning, and code generation. These advancements have sparked significant interest in developing LLM-based language agents to automate scientific discovery processes. Researchers are exploring the potential of these agents to revolutionise data-driven discovery workflows across various disciplines. The ambitious goal is to create automated systems that can handle the entire research process, from generating ideas to conducting experiments and writing papers. However, this ambitious vision faces numerous challenges, including the need for robust reasoning capabilities, effective tool utilization, and the ability to navigate the complexities of scientific inquiry. The true capabilities of such agents remain a subject of excitement and skepticism within the research community.

Researchers from the Department of Computer Science and Engineering, OSU, College of Pharmacy, OSU, Department of Geography, UW–Madison, Department of Psychology, OSU, Department of Chemistry, UW–Madison, and Department of Biomedical Informatics, OSU present ScienceAgentBench, a robust benchmark designed to evaluate language agents for data-driven discovery. This comprehensive evaluation framework is built on three key principles: scientific authenticity, rigorous graded evaluation, and careful multi-stage quality control. The benchmark curates 102 diverse tasks from 44 peer-reviewed publications across four scientific disciplines, ensuring real-world relevance and minimizing the generalization gap. ScienceAgentBench employs a unified output format of self-contained Python programs, enabling consistent evaluation through various metrics examining generated code, execution results, and associated costs. The benchmark’s construction involves multiple rounds of validation by annotators and subject matter experts, with strategies implemented to mitigate data contamination concerns. This robust approach provides a more nuanced and objective assessment of language agents’ capabilities in automating scientific workflows. It offers valuable insights into their strengths and limitations in real-world scientific scenarios.

ScienceAgentBench is a comprehensive benchmark designed to evaluate language agents on essential tasks in data-driven discovery workflows. The benchmark formulates each task as a code generation problem, requiring agents to produce executable Python programs based on natural language instructions, dataset information, and optional expert-provided knowledge. Each task in ScienceAgentBench consists of four key components: a concise task instruction, dataset information detailing structure and content, expert-provided knowledge offering disciplinary context, and an annotated program adapted from peer-reviewed publications. The benchmark’s construction involved a meticulous process of task annotation, data contamination mitigation, expert validation, and annotator verification. To ensure authenticity and relevance, 102 diverse tasks were curated from 44 peer-reviewed publications across four scientific disciplines. ScienceAgentBench implements strategies to mitigate data contamination and prevent agents from taking shortcuts, including dataset modifications and test set manipulations. This rigorous approach ensures a robust evaluation framework for assessing language agents’ capabilities in real-world scientific scenarios.

The evaluation of language agents on ScienceAgentBench reveals several key insights into their performance in data-driven discovery tasks. Claude-3.5-Sonnet emerged as the top-performing model, achieving a success rate of 32.4% without expert knowledge and 34.3% with expert knowledge using the self-debug framework. This performance significantly outpaced direct prompting methods, which achieved only 16.7% and 20.6% success rates respectively. The self-debug approach proved highly effective, nearly doubling the success rate compared to direct prompting for Claude-3.5-Sonnet. Interestingly, the self-debug method also outperformed the more complex OpenHands CodeAct framework for most models, with Claude-3.5-Sonnet solving 10.8% more tasks at 17 times lower API cost. Expert-provided knowledge generally improved success rates and code-based similarity scores but sometimes led to decreased verification rates due to increased complexity in tool usage. Human evaluation corroborated these findings, showing clear distinctions between successful and failed programs, particularly in the data loading and processing stages. Despite these advancements, the results indicate that current language agents still struggle with complex tasks, especially those involving specialized tools or heterogeneous data processing in fields like Bioinformatics and Computational Chemistry.

ScienceAgentBench introduces a rigorous benchmark for evaluating language agents in data-driven scientific discovery. Comprising 102 real-world tasks from diverse scientific disciplines, the benchmark reveals the current limitations of language agents, with the best-performing model solving only 34.3% of tasks. This outcome challenges claims of full automation in scientific workflows and emphasizes the need for more robust evaluation methods. ScienceAgentBench serves as a crucial testbed for developing enhanced language agents, focusing on improving scientific data processing and knowledge utilization. It also paves the way for designing advanced automatic grading metrics, potentially incorporating LLM-based judges using task-specific rubrics.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post ScienceAgentBench: A Rigorous AI Evaluation Framework for Language Agents in Scientific Discovery appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ScienceAgentBench 语言代理 数据驱动发现 评估基准
相关文章