MarkTechPost@AI 03月09日
Tufa Labs Introduced LADDER: A Recursive Learning Framework Enabling Large Language Models to Self-Improve without Human Intervention
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Tufa Labs 推出 LADDER 框架,旨在使大型语言模型 (LLM) 能够通过递归生成和解决由难到简的问题变体来自我改进。与依赖人工干预或精选数据集的传统方法不同,LADDER 利用模型自身的能力来创建自然的难度梯度,从而实现结构化的自我学习。该框架已在数学积分任务中进行了测试,结果表明其能有效提高模型性能,例如将 Llama 3.2 模型在本科积分问题上的准确率从 1% 提升至 82%,并在 MIT Integration Bee 考试中超越 GPT-4o 和人类的表现。

💡LADDER 框架的核心在于通过变体生成、解题验证和强化学习三个主要步骤,系统地分解复杂问题,引导 LLM 自主学习。变体生成确保模型生成由难到易的问题版本,解题验证通过数值积分方法评估解的正确性,强化学习则使用 GRPO 算法高效训练模型。

📈实验结果表明,使用 LADDER 训练的 Llama 3.2 3B 模型在本科积分问题数据集上的准确率达到 82%,显著高于使用 pass@10 抽样时的 2%。增加生成的变体数量可以持续提高性能。在 MIT Integration Bee 考试中,Deepseek-R1 Qwen2.5 7B 模型超越了未经过递归训练的更大模型,证明了结构化自我改进在数学推理中的有效性。

💰LADDER 框架无需外部数据集或人工干预,降低了 LLM 训练的成本,并具有良好的可扩展性。该方法为 AI 模型在没有外部监督的情况下改进推理能力提供了一种结构化途径,并可扩展到竞争性编程、定理证明和基于代理的问题解决等领域。

Large Language Models (LLMs) benefit significantly from reinforcement learning techniques, which enable iterative improvements by learning from rewards. However, training these models efficiently remains challenging, as they often require extensive datasets and human supervision to enhance their capabilities. Developing methods that allow LLMs to self-improve autonomously without additional human input or large-scale architectural modifications has become a major focus in AI research.

The key challenge in training LLMs is ensuring the learning process is efficient and structured. The training process can stall when models encounter problems beyond their capabilities, leading to poor performance. Traditional reinforcement learning techniques rely on well-curated datasets or human feedback to create effective learning pathways, but this approach is resource-intensive. Also, LLMs struggle to improve systematically without a structured difficulty gradient, making it difficult to bridge the gap between basic reasoning tasks and more complex problem-solving.

Existing approaches to training LLMs primarily involve supervised fine-tuning, reinforcement learning from human feedback (RLHF), and curriculum learning. Supervised fine-tuning requires manually labeled datasets, which can lead to overfitting and limited generalization. RLHF introduces a layer of human oversight, where models are refined based on human evaluations, but this method is costly and does not scale efficiently. Curriculum learning, which gradually increases task difficulty, has shown promise, but current implementations still rely on pre-defined datasets rather than allowing models to generate their learning trajectories. These limitations highlight the need for an autonomous learning framework that enables LLMs to improve their problem-solving abilities independently.

Researchers from Tufa Labs introduced LADDER (Learning through Autonomous Difficulty-Driven Example Recursion) to overcome these limitations. This framework enables LLMs to self-improve by recursively generating and solving progressively simpler variants of complex problems. Unlike prior methods that depend on human intervention or curated datasets, LADDER leverages the model’s capabilities to create a natural difficulty gradient, allowing for structured self-learning. The research team developed and tested LADDER on mathematical integration tasks, demonstrating its effectiveness in enhancing model performance. By applying LADDER, the researchers enabled a 3-billion-parameter Llama 3.2 model to improve its accuracy on undergraduate integration problems from 1% to 82%, an unprecedented leap in mathematical reasoning capabilities. Also, the approach was extended to larger models, such as Qwen2.5 7B Deepseek-R1 Distilled, achieving 73% accuracy on the MIT Integration Bee qualifying examination, far surpassing models like GPT-4o, which gained only 42%, and typical human performance in the 15-30% range.

LADDER follows a structured methodology that allows LLMs to bootstrap their learning by systematically breaking down complex problems. The process involves three primary components: variant generation, solution verification, and reinforcement learning. The variant generation step ensures the model produces progressively easier versions of a given problem, forming a structured difficulty gradient. The solution verification step employs numerical integration methods to assess the correctness of generated solutions, providing immediate feedback without human intervention. Finally, the reinforcement learning component uses Group Relative Policy Optimization (GRPO) to train the model efficiently. This protocol enables the model to learn incrementally by leveraging verified solutions, allowing it to refine its problem-solving strategies systematically. The researchers extended this approach with Test-Time Reinforcement Learning (TTRL), which dynamically generates problem variants during inference and applies reinforcement learning to refine solutions in real time. When applied to the MIT Integration Bee qualifying examination, TTRL boosted model accuracy from 73% to 90%, surpassing OpenAI’s o1 model.

When tested on a dataset of 110 undergraduate-level integration problems, a Llama 3.2 3B model trained with LADDER achieved 82% accuracy, compared to 2% accuracy when using pass@10 sampling. The approach also demonstrated scalability, as increasing the number of generated variants led to continued performance improvements. In contrast, reinforcement learning without variants failed to achieve meaningful gains, reinforcing the importance of structured problem decomposition. The researchers observed that LADDER-trained models could solve integrals requiring advanced techniques that were previously out of reach. Applying the methodology to the MIT Integration Bee qualifying examination, a Deepseek-R1 Qwen2.5 7B model trained with LADDER outperformed larger models that did not undergo recursive training, showcasing the effectiveness of structured self-improvement in mathematical reasoning.

Key Takeaways from the Research on LADDER include:

    Enables LLMs to self-improve by recursively generating and solving simpler variants of complex problems.Llama 3.2 3B model improved from 1% to 82% on undergraduate integration tasks, demonstrating the effectiveness of structured self-learning.Qwen2.5 7B Deepseek-R1 Distilled achieved 73% accuracy, outperforming GPT-4o (42%) and exceeding human performance (15-30%).Further boosted accuracy from 73% to 90%, surpassing OpenAI’s o1 model.LADDER does not require external datasets or human intervention, making it a cost-effective and scalable solution for LLM training.Models trained with LADDER demonstrated superior problem-solving capabilities compared to reinforcement learning without structured difficulty gradients.The framework provides a structured way for AI models to refine their reasoning skills without external supervision.The methodology can be extended to competitive programming, theorem proving, and agent-based problem-solving.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Tufa Labs Introduced LADDER: A Recursive Learning Framework Enabling Large Language Models to Self-Improve without Human Intervention appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LADDER 自主学习 大型语言模型 强化学习 数学推理
相关文章