Energy-Based Transformers are Scalable Learners and Thinkers

少点错误 07月08日 21:47

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了通过无监督学习实现类似人类“系统二”思维的通用方法。研究者提出了一种名为能量Transformer（EBTs）的新型能量模型，它通过学习验证输入与候选预测之间的兼容性，并将预测问题转化为基于梯度的能量最小化。EBTs 能够跨越离散（文本）和连续（视觉）模态，在训练和推理方面均表现出色，尤其是在处理分布外数据时。与现有模型相比，EBTs 在各种下游任务中展现出更强的泛化能力，为模型扩展学习和思维能力提供了新范式。

🧠 EBTs 是一种新型的能量模型，它通过学习验证输入与候选预测之间的兼容性（未归一化概率）来实现“系统二”思维。

💡 EBTs 将预测问题重新定义为基于梯度的能量最小化，这使得模型能够从无监督学习中发展出“系统二”思维能力，并且不依赖于特定的模态或问题。

📈 在训练方面，EBTs 的扩展速度比主流的 Transformer++ 方法更快，在数据、批量大小、参数、FLOPs 和深度方面，扩展率提高了 35%。

🚀 在推理阶段，EBTs 通过“系统二”思维（即额外计算）在语言任务上比 Transformer++ 提高了 29% 的性能，并且在图像去噪方面，EBTs 的表现优于扩散 Transformer，但仅使用了 1% 的前向传播次数。

🌐 EBTs 在处理分布外数据时，能够实现更优异的性能提升，并在大多数下游任务中超越现有模型，展现出更强的泛化能力。

Published on July 8, 2025 1:44 PM GMT

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility (unnormalized probability) between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs)---a new class of Energy-Based Models (EBMs)---to assign an energy (unnormalized probability) value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. This formulation enables System 2 Thinking to emerge from unsupervised learning, making it modality and problem agnostic. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking (i.e., extra computation) by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using 99% fewer forward passes. Further, we find that System 2 Thinking with EBTs yields larger performance improvements on data that is farther out-of-distribution, and that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, enabling EBTs to out-generalize existing paradigms. Consequently, EBTs are a flexible and exciting new paradigm for scaling both the learning and thinking capabilities of models.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签