MarkTechPost@AI 18小时前
R-Zero: A Fully Autonomous AI Framework that Generates Its Own Training Data from Scratch
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

R-Zero框架革新了大型语言模型(LLMs)的推理能力训练方式,首次实现了无需外部数据标签的自主进化。该框架通过“挑战者”和“求解者”两个模型实例的协同进化,由挑战者生成难度适中的推理任务,求解者则不断学习解决这些任务,从而逐步提升推理能力。这种“零数据”的训练模式利用了模型自身的不确定性作为奖励信号,并结合了GRUPO算法和不确定性驱动的课程学习策略,有效解决了传统依赖人工标注数据集的瓶颈,在数学和通用推理基准测试中均取得了显著的性能提升,为构建可扩展、无依赖的AI推理能力开辟了新路径。

💡 **自主进化训练框架**:R-Zero提出了一种全新的LLM推理能力训练范式,完全摆脱了对外部人工标注数据集的依赖。它引入了一个“挑战者”模型和一个“求解者”模型,两者通过协同进化不断提升推理水平,挑战者负责生成新的、具有挑战性的推理任务,而求解者则致力于解决这些任务。

🚀 **“零数据”训练机制**:该框架的核心创新在于其“零数据”训练方法。挑战者模型通过强化学习(GRPO)被训练来生成对求解者而言具有挑战性的问题,其奖励信号基于求解者回答的不确定性——当求解者回答极不一致时(准确率接近50%),奖励最大。求解者则利用挑战者生成的问题进行微调,通过多数投票的方式生成伪标签。

🧠 **不确定性驱动的课程学习**:R-Zero采用不确定性驱动的课程学习策略,确保训练数据始终处于求解者能力的前沿。挑战者被奖励生成那些既不太容易也不太难、使求解者准确率接近50%的任务,从而最大化学习效率。同时,通过重复惩罚和格式检查来保证训练数据的多样性和质量。

📈 **显著的性能提升**:在多个严格的数学推理基准测试(如AMC、GSM8K等)和通用推理基准测试(如MMLU-Pro、BIG-Bench Extra Hard等)中,R-Zero框架展示了其强大的能力。经过迭代训练后,模型在数学和通用推理准确率上均实现了显著提升,证明了其在提升LLM推理能力方面的有效性。

🌐 **推动AI发展新方向**:R-Zero的出现标志着AI发展的一个重要里程碑,它为实现自给自足、超越人类水平的LLM推理能力提供了可行路径。该框架的开源属性鼓励研究者和开发者进行实验,共同推动可扩展、数据无关的AI开发新浪潮。

Large Language Models (LLMs) have revolutionized fields from natural language understanding to reasoning and code generation. However, pushing their reasoning ability to truly superhuman levels has been limited by the need for massive, high-quality, human-annotated datasets. A team of researchers from Tencent AI Seattle Lab, Washington University, the University of Maryland, and the University of Texas have proposed R-Zero, a framework designed to train reasoning LLMs that can self-evolve without relying on external data labels.

Beyond Human-Curated Data

Most progress in LLM reasoning is tethered to datasets laboriously curated by humans, an approach that is resource-intensive and fundamentally limited by human knowledge. Even label-free methods using LLMs’ own outputs for reward signals still depend on existing collections of unsolved tasks or problems. These dependencies bottleneck scalability and hinder the dream of open-ended AI reasoning beyond human capabilities.

R-Zero: Self-Evolution from Zero Data

R-Zero forges a novel path by entirely removing the reliance on external tasks and labels. Instead, it introduces a co-evolutionary dynamic between two instances of a base model:

This synergy enables the curriculum—the set of training data—to be self-generated and adapted continuously to the model’s evolving strengths and weaknesses. The process works as follows:

    Challenger Training: Trained via reinforcement learning (specifically Group Relative Policy Optimization [GRPO]), it generates diverse, hard-to-solve questions. The reward signal for each question is based on the Solver’s uncertainty: highest when Solver’s answers are maximally inconsistent (empirical accuracy approaches 50%).Solver Training: Solver is fine-tuned on the Challenger’s curated problems. Pseudo-labels (answers) are determined by majority vote among Solver’s own responses. Only questions with answers neither too consistent nor too scattered (i.e., in an informative band) are used for training.Iterative Loop: Challenger and Solver alternate roles, co-evolving over several rounds, progressively improving reasoning abilities without human intervention.

Key Technical Innovations

Empirical Performance

Mathematical Reasoning Benchmarks

R-Zero was evaluated using seven rigorous mathematical benchmarks, including AMC, Minerva, MATH-500, GSM8K, Olympiad-Bench, and AIME competitions. Compared with the base model and non-trained Challenger baseline, three iterations of R-Zero led to substantial improvements in reasoning accuracy across all model sizes and architectures (e.g., Qwen3-8B-Base improved from 49.18 to 54.69 average score after three iterations).

General Reasoning Benchmarks

Crucially, R-Zero’s improvements generalize beyond math. Benchmarks including MMLU-Pro, SuperGPQA, and BIG-Bench Extra Hard (BBEH) show significant gains in general-domain reasoning accuracy (e.g., Qwen3-8B-Base’s overall average jumps from 34.49 to 38.73), demonstrating strong transfer effects.

Conclusion

R-Zero marks a major milestone toward self-sufficient, superhuman reasoning LLMs. Its fully autonomous co-evolutionary training pipeline offers not only strong empirical gains in reasoning but a new lens through which to view scalable, data-free AI development. Researchers and practitioners can experiment with this framework today, leveraging open-source tools to pioneer the next era of reasoning-centric language models.


Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post R-Zero: A Fully Autonomous AI Framework that Generates Its Own Training Data from Scratch appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

R-Zero 大型语言模型 LLMs AI推理 自主进化 零数据训练
相关文章