Data Readiness for Scientific AI at Scale

cs.AI updates on arXiv.org 前天 12:08

Data Readiness for Scientific AI at Scale

本文探讨数据准备原则在领导规模科学数据集训练基础模型中的应用，分析气候、核聚变、生物/健康和材料四个领域的典型工作流程，提出适用于高性能计算环境的二维数据准备框架，强调可扩展AI训练的关键挑战。

arXiv:2507.23018v1 Announce Type: new Abstract: This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据准备原则科学数据训练高性能计算

相关文章

Weka Makes Life Simpler for Developers, Engineers, and Architects

Parallelism and Acceleration for Large Language Models with Bryan Catanzaro - #507

Supercomputer Fugaku Retains First Place Worldwide in HPCG and GRAPH500 Rankings in 2024

Top 500: Aurora Breaks into Exascale, but Can’t Get to the Frontier of HPC

Aurora Proves AI Supremacy, Snags Top AI Spot from Frontier

How do AI supercomputers train large Gen AI models? Simply Explained

GenAI to HPC Jobs in Code Generation, Using NVIDIA Tech

中贝通信：与安联通签订4.84亿元算力技术服务合同

CoreWeave 出价约 10 亿美元收购 Core Scientific

Skywork Team Introduces Skywork-MoE: A High-Performance Mixture-of-Experts (MoE) Model with 146B Parameters, 16 Experts, and 22B Activated Parameters