MarkTechPost@AI 04月23日 13:45
LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

清华大学与上海人工智能实验室的研究人员提出了一种名为Test-Time Reinforcement Learning (TTRL)的新方法,使大型语言模型(LLMs)能够在没有人工标注的情况下进行学习和改进。TTRL通过在推理过程中应用强化学习,利用预训练模型的内在先验知识,从无标签的测试数据中生成伪奖励。这种方法显著提高了模型在数学推理任务上的表现,例如Qwen2.5-Math-7B在AIME 2024上的准确率提升了159.3%。TTRL提供了一种可扩展的解决方案,使LLMs能够在真实世界环境中持续进化。

🧠 TTRL的核心在于利用模型自身的输出来进行自我监督学习,通过在推理时应用强化学习,无需依赖人工标注数据。

🗳️ TTRL采用两阶段方法:首先通过多数投票来估计标签,即模型对同一问题的多次输出结果进行统计,出现频率最高的预测被视为估计标签;然后,基于每个采样输出是否与估计标签匹配来分配奖励,并使用梯度下降的强化学习算法来优化模型。

📈 实验结果表明,TTRL在数学推理任务中表现出色,例如Qwen2.5-Math-7B模型在AIME 2024数据集上的性能提升显著,且TTRL在不同任务间的泛化能力强。

💡 TTRL具备广泛的适用性,它与不同的强化学习算法兼容,并且可以根据模型大小进行扩展,为LLMs在各种应用场景中的持续改进提供了新的可能性。

Despite significant advances in reasoning capabilities through reinforcement learning (RL), most large language models (LLMs) remain fundamentally dependent on supervised data pipelines. RL frameworks such as RLHF have pushed model alignment and instruction-following performance but rely heavily on human feedback and labeled datasets. As LLMs are increasingly applied in dynamic environments—ranging from educational settings to scientific workflows—they are required to generalize beyond curated training data.

However, existing models often exhibit performance gaps when confronted with distribution shifts or novel reasoning tasks. While techniques like Test-Time Scaling (TTS) and Test-Time Training (TTT) have been proposed to mitigate this, the absence of reliable reward signals during inference poses a core challenge for deploying RL in unsupervised settings.

Test-Time Reinforcement Learning (TTRL): Leveraging Model Priors for Self-Adaptation

Researchers from Tsinghua University and Shanghai AI Lab introduced Test-Time Reinforcement Learning (TTRL). TTRL is a training framework that applies RL during inference, using only unlabeled test data. It leverages the intrinsic priors of pre-trained language models to estimate pseudo-rewards through majority voting across sampled outputs.

Instead of relying on explicit labels, TTRL constructs reward functions by aggregating multiple model-generated responses to a given query. A consensus answer, obtained via majority voting, is treated as a pseudo-label. Model responses that align with this pseudo-label are positively reinforced. This formulation transforms test-time inference into an adaptive, self-supervised learning process, allowing LLMs to improve over time without additional supervision.

TTRL has a two-stage approach:

This approach is notable for its simplicity and compatibility with standard RL methods. The reward function, though approximate, provides sufficient learning signal when aggregated over multiple samples. Experimental setups used temperature-controlled sampling (typically temperature = 1.0), with 64 samples for voting and 16 subsampled responses for training updates. No ground-truth labels are involved at any stage.

Empirical Findings across Mathematical Reasoning Tasks

TTRL was evaluated on three mathematical benchmarks: AIME 2024, AMC, and MATH-500. The results are consistent across both smaller and larger models:

These gains demonstrate that TTRL supports model improvement even in the absence of supervised training signals. Moreover, TTRL often outperforms the upper bound implied by its own training signal—i.e., the accuracy of the majority-voted predictions. This suggests a self-reinforcing learning loop that can extract richer supervision from noisy consensus signals.

Additional analyses showed that TTRL generalizes beyond the dataset it was applied to. When trained on one benchmark and evaluated on others, performance improvements persisted. This cross-task transfer indicates that TTRL does not lead to narrow overfitting but supports broader generalization.

Conclusion: Toward Self-Adaptive and Label-Free Learning

TTRL represents a novel shift in how reinforcement learning can be applied to LLMs in real-world settings. By reusing the model’s own generations as a proxy for supervision, it removes the need for expensive human annotations while enabling continual adaptation. The approach scales naturally with model size, is compatible with different RL algorithms, and shows promising robustness across tasks of varying difficulty.

While this study focuses on mathematical reasoning, the underlying ideas—self-estimated supervision, test-time adaptation, and reinforcement learning without labels—may generalize to other domains. As language models increasingly encounter tasks beyond their pre-training distribution, frameworks like TTRL offer a scalable path forward.

Further exploration is needed to understand the theoretical convergence properties of TTRL and to evaluate its applicability in interactive or multi-agent scenarios. Nonetheless, TTRL provides a technically sound and computationally efficient foundation for enabling LLMs to evolve continuously from their own outputs.


Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TTRL LLM 强化学习 自监督学习
相关文章