MarkTechPost@AI 05月10日 07:25
AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

清华大学等机构的研究者提出了一种名为“绝对零度”的强化学习范式,旨在解决现有强化学习方法对人工标注数据的依赖问题。该方法通过让模型自主生成并解决代码推理任务,利用代码执行器验证答案,从而实现无需任何外部数据的自学习。研究者们提出了“绝对零度推理器”(AZR),它能够自主进化训练课程和推理能力。实验结果表明,AZR在7B模型上取得了最优性能,并在代码推理方面超越了使用人工标注数据训练的模型。然而,研究也发现该方法存在安全隐患,需要持续的监督和改进。

💡绝对零度(Absolute Zero)是一种新的强化学习范式,它允许模型在没有任何外部数据的情况下,自主生成和解决任务,从而提升推理能力。这种方法旨在解决传统强化学习对大量人工标注数据的依赖问题。

🖥️绝对零度推理器(AZR)是该范式的具体实现,它通过代码执行器来验证提出的代码推理任务和答案,为开放式的学习提供了一个统一的可验证的奖励来源。AZR可以在不同的模型规模上有效实施,并且与各种模型类型兼容。

🚀实验结果表明,绝对零度推理器-Coder-7B在7B总体平均和编码平均类别中取得了最先进的性能,即使在数学和代码推理基准测试中完全是分布外的,也超过了以前的最佳模型1.8个绝对百分点。在编码方面,它优于使用专家策划的人工数据训练的模型0.3个绝对百分点,同时从未访问过此类数据本身。

⚠️研究也指出了安全问题,在Llama-3.1-8B模型中观察到了一些令人担忧的CoT推理实例,表明在减少人工干预的同时,仍需要持续的监督来解决潜在的安全隐患。

LLMs have shown advancements in reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR), which relies on outcome-based feedback rather than imitating intermediate reasoning steps. Current RLVR works face critical scalability challenges as they heavily depend on manually curated collections of questions and answers for training. As reasoning models advance, constructing large-scale, high-quality datasets becomes increasingly unsustainable, similar to bottlenecks identified in LLM pretraining. Moreover, exclusive dependency on human-designed tasks may constrain AI systems’ capacity for autonomous learning and development, especially as they evolve beyond human intellectual capabilities.

Researchers have explored various approaches to enhance LLM reasoning capabilities. STaR pioneered self-bootstrapping using expert iteration and rejection sampling of outcome-verified responses to improve CoT reasoning. The o1 model deployed this concept at scale, achieving state-of-the-art results, and R1 later became the first open-weight model to match or surpass o1’s performance by introducing the “zero” setting where RL is applied directly to the base LLM. Further, self-play paradigms have evolved from Schmidhuber’s early two-agent setups to more complex implementations like AlphaGo and AlphaZero. Recent methods such as SPIN, Self-Rewarding Language Models, SPC, and SPAG have applied self-play to language models for alignment and reasoning.

Researchers from Tsinghua University, Beijing Institute for General Artificial Intelligence, and Pennsylvania State University have proposed an RLVR paradigm called Absolute Zero to enable a single model to autonomously generate and solve tasks that maximize its own learning progress without relying on any external data. Under this method, researchers have introduced the Absolute Zero Reasoner (AZR) that self-evolves its training curriculum and reasoning ability through a code executor that validates proposed code reasoning tasks and verifies answers, providing a unified source of verifiable reward to guide open-ended yet grounded learning. AZR can be effectively implemented across different model scales and remains compatible with various model classes, suggesting broad applicability.

LLMs provide an ideal framework for implementing AZR in multitask learning contexts. During each online rollout iteration in the absolute zero setting’s objective equation, AZR proposes new reasoning tasks based on task type and past self-generated examples, with explicit prompting to generate diverse tasks and then attempts to solve them, receiving grounded feedback for its model responses. AZR utilizes a code executor as both a flexible interface and verifiable environment, enabling automatic construction, execution, and validation of code reasoning tasks. Lastly, the AZR Algorithm includes buffer initialization, Task Proposal Inputs and Buffer Management, valid task construction, solution validation, and advantage estimator calculation through Task-Relative REINFORCE++.

The Absolute Zero Reasoner-Coder-7B has achieved state-of-the-art performance in the 7B overall average and coding average categories, surpassing previous best models by 1.8 absolute percentage points despite being entirely out-of-distribution for both math and code reasoning benchmarks. It outperforms models trained with expert-curated human data in coding by 0.3 absolute percentage points while never accessing such data itself. Scaling analysis reveals that AZR delivers greater gains on larger models, with the 7B and 14B models continuing to improve beyond 200 training steps while the 3B model plateaus. Out-of-distribution performance gains increase with model size: +5.7, +10.2, and +13.2 for 3B, 7B, and 14B, respectively.

In conclusion, researchers introduced the Absolute Zero paradigm to address data limitations in existing RLVR frameworks. Under this method, researchers present AZR, which trains models to propose and solve code-related reasoning tasks grounded by a code executor. However, there is a limitation regarding safety management in self-improving systems. The team observed several instances of safety-concerning CoT reasoning from the Llama-3.1-8B model, termed “uh-oh moments.” The findings indicate that while the Absolute Zero paradigm reduces human intervention needs in task curation, ongoing oversight remains necessary to address lingering safety concerns, highlighting a critical direction for future research.


Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

The post AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

绝对零度 强化学习 LLM 自主学习 代码推理
相关文章