Artificial Fintelligence 06月05日 23:40
Reinforcement learning and general intelligence
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了强化学习(RL)在人工智能领域,特别是通往通用人工智能(AGI)道路上的关键作用和面临的挑战。作者认为,尽管RL是实现AGI最有希望的途径,但其应用面临诸多困难,如探索与利用的权衡、对奖励信号质量的依赖以及容易过拟合等问题。文章强调了RL在处理复杂问题和发现新知识方面的潜力,但也指出了在实际应用中需要克服的技术障碍,并呼吁研究机构加大对RL基础研究的投入。

💡 强化学习(RL)是机器学习的一个分支,专注于研究能发现新知识的算法。RL智能体在环境中采取行动,通过系统地探索来寻找最优策略。

🤔 RL的核心挑战在于探索与利用的权衡。智能体需要在尝试新行动(探索)和选择当前认为最佳的行动(利用)之间做出决策,这在复杂环境中尤为困难。

📚 大语言模型在RL中展现出优势,它们拥有强大的先验知识,能更有效地利用样本。然而,RL在语言模型中的应用主要集中于行为修正,而非知识发现。

⚠️ RL的成功高度依赖于奖励信号的质量。复杂的、过拟合的奖励模型会导致问题。像Allen Institute的Tulu 3那样使用可验证奖励的系统,似乎是解决之道。

🚀 AlphaGo的例子完美地说明了RL的悖论:RL是实现超人水平表现的唯一可行方法,但同时也需要巨大的资源和努力。在语言模型领域,我们正面临着类似的挑战。

A disclaimer: nothing that I say here is representing any organization other than Artificial Fintelligence. These are my views, and mine alone, although I hope that you share them after reading.

Frontier labs are spending, in the aggregate, $100s of millions of dollars annually on data acquisition, leading to a number of startups selling data to them (Mercor, Scale, Surge, etc). The novel data, combined with reinforcement learning (RL) techniques, represents the most clear avenue to improvement, and to AGI. I am firmly convinced that scaling up RL techniques will lead to excellent products, and, eventually, AGI. A primary source of improvement over the last decade has been scale, as the industry has discovered one method after another that allows us to convert money into intelligence. First, bigger models. Then, more data (thereby making Alexandr Wang very rich). And now, RL.

Thanks for reading Artificial Fintelligence! Subscribe for free to receive new posts and support my work.

RL is the subfield of machine learning that studies algorithms which discover new knowledge. Reinforcement learning agents take actions in environments to systematically discover the optimal strategy (called a policy). An example environment is Atari: you have an environment (the Atari game) where the agent can take actions (moving in different directions, pressing the “fire” button) and the agent receives a scalar reward signal that it wants to maximize (the score). Without providing any data on how to play Atari games, RL algorithms are able to discover policies which get optimal scores in most Atari games.

The key problem in RL is the exploration/exploitation tradeoff. At each point that the agent is asked to choose an action, the agent has to decide between choosing the action which they currently think is best (”exploiting”) or trying a new action which might be better (”exploring”). This is an extremely difficult decision to get right. Consider a complicated game like Starcraft, or Dota. For any individual situation that the agent is in, how can we know what the optimal action is? It’s only after making an entire game’s worth of decisions that we are able to know if our strategy is sound. and it is only after playing many games that we are able to conclude how good we are in comparison to other players.

Large language models help significantly here, as they are much, much more sample efficient because they have incredibly strong priors. By encoding a significant fraction of human knowledge, the models are able to behave well in a variety of environments before they’ve actually received any training data.

When it comes to language modelling, most use of RL to date has been for RLHF, which is mostly used for behaviour modification. As there is (typically) no live data involved, RLHF isn’t “real” RL and does not face the exploration/exploitation tradeoff, nor does it allow for the discovery of new knowledge.

Knowledge discovery is the main unsolved problem in modern machine learning. While we've become proficient at supervised learning, we haven't yet cracked the code on how to systematically discover new knowledge, especially superhuman knowledge. For AlphaStar, for instance, they spent a lot of compute discovering new policies, as it is an extraordinarily hard problem to discover good strategies in Starcraft without prior knowledge.

Therein lies the rub; RL is simultaneously the most promising and most challenging approach we have. DeepMind invested billions of dollars in RL research with little commercial success to show for it (the Nobel prize, for instance, was for AlphaFold, which didn’t use RL). While RL is often the only solution for certain hard problems, it is notoriously difficult to implement effectively. Consider a game with discrete turns, like Chess or Go. In Go, you have on average 250 different choices at each turn, and the game lasts for 150 moves. Consequently, the game tree has approximately 250^150 nodes, or ~10^360. If searching randomly (which is how many RL algorithms explore), it is exceedingly difficult to find a reasonable trajectory in the game, which is why AlphaZero style selfplay is needed, or an AlphaGo style supervised learning phase. When we consider the LLM setting, in which typical vocabulary sizes are in the 10s to 100s of thousands of tokens, and sequence lengths can be in the 10s to 100s of thousands, the problem is made much worse. The result is a situation where RL is both necessary and yet should be considered a last resort.

Put differently, one way to think of deep learning is that it’s all about learning a good, generalizable, function approximation. In deep RL, we are approximating a value function, i.e. a function that tells us exactly how good or how bad a given state of the world would be. To improve the accuracy of the value function, we need to be able to receive data with non-trivial answers. If all we receive is the same reward (and it’s really bad), we can’t do anything. Consider a coding assistant, like Cursor’s newly released background agent. One way to train the agent would be to give it a reward of 1 if it returns code which is merged into a pull request, and 0 otherwise. If you took a randomly initialized network, it would output gibberish, and would thus always receive a signal of 0. Once you get a model that is actually good enough to sometimes be useful to users, you can start getting meaningful signal and rapidly improve.

As an illustrative example, I have a friend who works at a large video game publisher doing RL research for games (think: EA, Sony, Microsoft, etc.). He consults with teams at the publisher’s studios that want to use RL. His first response, despite being an experienced RL practitioner with more than 2 decades of RL experience, is usually to ask if they've tried everything else, because it’s so difficult to get RL to work in practical settings.

The great question with reinforcement learning and language models is whether or not we’ll see results transfer to other domains, like we have seen with next token prediction. The great boon of autoregressive language models has been that it generalizes well, that is, you can train a model to predict the next token and it learns to generate text that is useful in a number of other situations. It is absolutely not clear whether that will be the case with models trained largely with RL, as RL policies tend to be overly specialized to the exact problem they were trained on. AlphaZero notoriously had problems with catastrophic forgetting; a paper that I wrote while at DeepMind showed that simple exploits existed which could consistently beat AlphaZero. This has been replicated consistently in a number of other papers. To get around this, many RL algorithms require repeatedly looking at the training data via replay buffers, which is awkward and unwieldy.

With LLMs, this is a major problem. Setting aside RL, in the open research space, we see a lot of VLMs that are trained separately from their LLMs equivalents. DeepSeek-VL2 is a separate family of models from V3, which is text-only, despite all the major closed source models accepting multimodal inputs. The main reason for the separation being that, in the published literature, adding multimodal capacities to LLMs sacrifices pure text performance. When we go to add in RL, we should expect the problem to become much worse, and more research to be dedicated to improving the inherent tradeoffs here.

In my experience as a practitioner, RL lives or dies based on the quality of the reward signal. One of the most able RL practitioners that I know, Adam White, begins all of his RL projects by first learning to predict the reward signal; and only then will try to optimize it (first predict, and then control). Systems that are optimizing complex, overfit reward models will struggle. Systems like the Allen Institute's Tulu 3, which used verifiable rewards to do RL, seem like the answer, and provide motivation for the hundreds of millions of dollars that the frontier labs are spending on acquiring data.

The development of AlphaGo illustrates this paradox perfectly:

We're now facing a similar situation with language models:

    We've largely exhausted the easily accessible training data

    We need to discover new knowledge to progress further

    For superhuman knowledge in particular, we can't rely on human supervision by definition

    RL appears to be the only framework general enough to handle this challenge

In short, this is a call for research labs to start investing in fundamental RL research again, and in particular, on finally making progress on the exploration problem.

Subscribe now

1

I actually can’t think of any successful applications of MCTS to solve real world problems. Other than the AlphaGo/AlphaZero/MuZero line of work, it doesn’t seem to have led to anything, which 2017 Finbarr would have found extremely surprising.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 AGI 人工智能 RL
相关文章