MarkTechPost@AI 2024年07月13日
Can LLMs Help Accelerate the Discovery of Data-Driven Scientific Hypotheses? Meet DiscoveryBench: A Comprehensive LLM Benchmark that Formalizes the Multi-Step Process of Data-Driven Discovery
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DISCOVERYBENCH 是一个新的基准测试,旨在评估大型语言模型(LLM)在数据驱动发现中的能力。该基准测试通过将发现任务定义为在一个特定背景下寻找变量之间的关系,并使用结构化的假设表示和评估方法,提供了对现实世界数据驱动发现的全面评估。DISCOVERYBENCH 包含两个主要组件:DB-REAL 和 DB-SYNTH,前者包含来自六个领域的真实世界假设和工作流程,后者是用于控制模型评估的合成基准。通过评估多个由不同语言模型驱动的发现代理,研究人员发现,在 DISCOVERYBENCH 上,所有代理-LLM 对的总体性能都较低,这突出了该基准测试的挑战性。

🤔 **DISCOVERYBENCH 的目标:** DISCOVERYBENCH 旨在系统地评估大型语言模型 (LLM) 在自动数据驱动发现中的能力。它将发现任务定义为在一个特定背景下寻找变量之间的关系,并使用结构化的假设表示和评估方法,提供了对现实世界数据驱动发现的全面评估。

🚀 **DISCOVERYBENCH 的组成:** DISCOVERYBENCH 包含两个主要组件:DB-REAL 和 DB-SYNTH。DB-REAL 包含来自六个领域的真实世界假设和工作流程,包括社会学、生物学、人文学科、经济学、工程学和元科学。DB-SYNTH 是一个合成生成的基准,允许进行受控的模型评估。它使用大型语言模型来生成不同的领域、构建语义假设树、创建合成数据集以及制定不同难度的发现任务。

📊 **DISCOVERYBENCH 的评估结果:** 研究人员在 DISCOVERYBENCH 数据集上评估了由不同语言模型 (GPT-4o、GPT-4p 和 Llama-3-70B) 驱动的多个发现代理。代理包括 CodeGen、ReAct、DataVoyager、Reflexion (Oracle) 和 NoDataGuess。结果表明,在 DB-REAL 和 DB-SYNTH 中,所有代理-LLM 对的总体性能都较低,这突出了该基准测试的挑战性。

💡 **DISCOVERYBENCH 的意义:** DISCOVERYBENCH 为评估 LLM 在数据驱动发现中的能力提供了一个全面的框架,并为研究人员提供了一个平台,以便进一步探索和改进 LLM 的能力,推动科学发现的自动化进程。

Scientific discovery has been a cornerstone of human advancement for centuries, traditionally relying on manual processes. However, the emergence of large language models (LLMs) with advanced reasoning capabilities and the ability to interact with external tools and agents has opened up new possibilities for autonomous discovery systems. The challenge lies in developing a fully autonomous system capable of generating and verifying hypotheses within the realm of data-driven discovery. While recent studies have shown promising results in this direction, the full extent of LLMs’ potential in scientific discovery remains uncertain. Researchers are now faced with the task of exploring and expanding the capabilities of these AI systems to revolutionize the scientific process, potentially accelerating the pace of discovery and innovation across various fields.

Previous attempts at automated data-driven discovery have ranged from early systems like Bacon, which fitted equations to idealized data, to more advanced solutions like AlphaFold, capable of handling complex real-world problems. However, these systems often relied on task-specific datasets and pre-built pipelines. AutoML tools, such as Scikit and cloud-based solutions, have made strides in automating machine learning workflows, but their datasets are primarily used for model training rather than open-ended discovery tasks. Similarly, statistical analysis datasets and software packages like Tableaux, SAS, and R support data analysis but are limited in scope. The QRData dataset represents a step towards exploring LLM capabilities in statistical and causal analysis, but it focuses on well-defined questions with unique, primarily numeric answers. These existing approaches, while valuable, need to provide a comprehensive solution for automating the entire discovery process, including ideation, semantic reasoning, and pipeline design.

Researchers from the Allen Institute for AI, OpenLocus, and the University of Massachusetts Amherst propose DISCOVERYBENCH which aims to systematically evaluate the capabilities of state-of-the-art large language models (LLMs) in automated data-driven discovery. This benchmark addresses the challenges of diversity in real-world data-driven discovery across various domains by introducing a pragmatic formalization. It defines discovery tasks as searching for relationships between variables within a specific context, where the description of these elements may not directly correspond to the dataset’s language. This approach allows for systematic and reproducible evaluation of a wide range of real-world problems by utilizing key facets of the discovery process.

DISCOVERYBENCH distinguishes itself from previous datasets for statistical analysis or AutoML by incorporating scientific semantic reasoning. This includes deciding on appropriate analysis techniques for specific domains, data cleaning and normalization, and mapping goal terms to dataset variables. The tasks typically require multistep workflows, addressing the broader data-driven discovery pipeline rather than focusing solely on statistical analysis. This comprehensive approach makes DISCOVERYBENCH the first large-scale dataset to explore LLMs’ capacity for the entire discovery process.

In this method, researchers begin by formalizing data-driven discovery by introducing a structured approach to hypothesis representation and evaluation. It defines hypotheses as declarative sentences verifiable through datasets, breaking them down into contexts, variables, and relationships. The key innovation is the Hypothesis Semantic Tree, a hierarchical structure representing complex hypotheses with interconnected variables. This tree allows for encoding multiple hypotheses within a single structure. The method also formalizes task datasets as collections of tuples supporting multiple hypothesis semantic trees, with varying degrees of observability. This framework provides a flexible yet rigorous approach to representing and evaluating complex discovery problems, enabling systematic assessment of automated discovery systems.

DISCOVERYBENCH consists of two main components: DB-REAL and DB-SYNTH. DB-REAL encompasses real-world hypotheses and workflows derived from published scientific papers across six domains: sociology, biology, humanities, economics, engineering, and meta-science. It includes tasks that often require analysis of multiple datasets, with workflows ranging from basic data preparation to advanced statistical analyses. DB-SYNTH, on the other hand, is a synthetically generated benchmark that allows for controlled model evaluations. It uses large language models to generate diverse domains, construct semantic hypothesis trees, create synthetic datasets, and formulate discovery tasks of varying difficulty. This dual approach allows DISCOVERYBENCH to capture both the complexity of real-world discovery problems and the systematic variation needed for comprehensive model evaluation.

The study evaluates several discovery agents powered by different language models (GPT-4o, GPT-4p, and Llama-3-70B) on the DISCOVERYBENCH dataset. The agents include CodeGen, ReAct, DataVoyager, Reflexion (Oracle), and NoDataGuess. Results show that overall performance is low across all agent-LLM pairs for both DB-REAL and DB-SYNTH, highlighting the benchmark’s challenging nature. Surprisingly, advanced reasoning prompts (React) and planning with self-criticism (DataVoyager) do not significantly outperform the simple CodeGen agent. However, Reflexion (Oracle), which uses feedback for improvement, shows notable gains over CodeGen. The study also reveals that non-reflexion agents mainly solve the easiest instances, and performance on DB-REAL and DB-SYNTH is similar, validating the synthetic benchmark’s ability to capture real-world complexities.

DISCOVERYBENCH represents a significant advancement in evaluating automated data-driven discovery systems. This comprehensive benchmark includes 264 real-world discovery tasks derived from published scientific workflows, complemented by 903 synthetically generated tasks designed to assess discovery agents at various difficulty levels. Despite employing state-of-the-art reasoning frameworks powered by advanced large language models, the best-performing agent only achieves a 25% success rate. This modest performance underscores the challenging nature of automated scientific discovery and highlights the considerable room for improvement in this field. By providing this timely and robust evaluation framework, DISCOVERYBENCH aims to stimulate increased interest and research efforts in developing more reliable and reproducible autonomous scientific discovery systems using large generative models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Can LLMs Help Accelerate the Discovery of Data-Driven Scientific Hypotheses? Meet DiscoveryBench: A Comprehensive LLM Benchmark that Formalizes the Multi-Step Process of Data-Driven Discovery appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 数据驱动发现 科学发现 DISCOVERYBENCH 基准测试
相关文章