Communications of the ACM - Artificial Intelligence 04月23日 21:58
A Rewarding Line of Work
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

理查德·萨顿和安德鲁·巴托因其在强化学习(RL)领域的开创性贡献,荣获2024年图灵奖。强化学习是一种重要的AI方法,通过奖励机制训练神经网络,类似于神经元通过积极行为获得的化学刺激。他们的工作推动了人工智能的发展,特别是对AlphaGo和大型语言模型的优化起到了关键作用。本文介绍了强化学习的核心概念、发展历程,以及其在AI和神经科学领域的深远影响,并探讨了该领域未来的发展方向和挑战。

💡 强化学习的核心思想:RL通过奖励机制训练AI,当神经网络达成目标时,会获得数值奖励;失败则得到负值。机器通过试错学习,不断调整行为以获得更多奖励。

🧠 关键技术:时间差分学习。该技术帮助AI预测每一步的最终奖励,并以此作为即时奖励。这种奖励的预测变化,即时间差分,为AI提供了强化信号,使其能够学习更有效的策略。

🧬 生物学关联:强化学习算法与生物学多巴胺系统之间存在惊人的相似性。多巴胺作为预测结果的中间奖励,而非结果本身,强化了有利行为,这揭示了工程学与生物学之间的深刻联系。

🚀 应用与未来:强化学习已在AlphaGo和大型语言模型中得到应用,并有望在未来的人工通用智能(AGI)中发挥关键作用。未来,多智能体强化学习可能会促进合作,但也可能引发冲突,对经济学和博弈论等领域产生影响。

Read the related Q&A

As an undergraduate at Stanford University in the mid-1970s, Richard Sutton pored through the school’s library, trying to read everything he could about learning and machine intelligence. What he found disappointed him, because he did not think it really got to the heart of the matter. “It was mostly pattern recognition. It was mostly learning from examples. And I knew from psychology that animals do very different things,” Sutton said. “They do things to get rewards.”

The one person he found whose writing focused on rewards as part of learning was A. Harry Klopf, a scientist studying machine intelligence at the U.S. Air Force Research Laboratory, who believed that neurons were hedonists, seeking rewards. A letter to Klopf led to a lunch meeting, and after Sutton graduated in 1978 with a B.A. in psychology, to a slot in a research project at the University of Massachusetts (UMass) Amherst, designed to test Klopf’s ideas.

Sutton joined Andrew Barto, a postdoctoral fellow who had arrived a year earlier to be part of the research team. With five years of funding from the Air Force and later the National Science Foundation, and no real deliverables beyond a report, Barto and Sutton followed their interests, and eventually wound up developing the modern field of reinforcement learning (RL), a key method in artificial intelligence (AI) that trains neural networks by offering them rewards, much like the chemical boost that neurons get from doing something positive. Thanks to that work, Barto and Sutton are the recipients of the 2024 ACM A.M. Turing Award.

RL is the method that helped Google’s deep neural network AlphaGo beat a human at the game Go in 2016, and is an important component of fine-tuning the large language models that are revolutionizing generative AI. It works on the notion that when a neural network achieves a goal—winning at chess, say—it receives a reward in the form of a numerical value. If it loses, it gets a negative value. The machine learns through trial and error, making different moves, with each being reinforced or penalized.

The concept of RL differs from that of supervised learning, in which data are labeled and used as examples. RL also differs from unsupervised learning, in which a computer tries to extract useful features to find structures in data. While both methods have proved useful in computing, neither is how living brains learn.

RL had been around for a long time before he and Sutton started working on it, Barto pointed out. It was the subject of AI pioneer Marvin Minsky’s thesis in 1954, and the method that IBM computer scientist Arthur Samuel used to train a computer to play checkers. By the 1970s, though, the idea had fallen out of fashion, with most AI researchers focused on expert systems instead. “We had the luxury of being able to be unfashionable,” Barto said, “so it’s astounding that it’s now very fashionable.”

A key development the pair came up with was temporal difference learning. If a computer is trying to learn to play chess, for example, and the reward comes from winning the game, that is not very useful for figuring out individual moves. Instead, the computer tries to predict at each step what the ultimate reward signal will be, and uses those predictions as rewards. If the probability of winning goes up after a move, that counts as an immediate reward. The change in predictions over time—the temporal difference—provides the reinforcement. The next time the computer plays chess, it can try moves that look similar to the ones that increased its prediction of winning.

Temporal difference learning turned out to have implications outside of AI. Neuroscientists studying the dopamine system in the living brain noticed that it behaved similarly to what Barto and Sutton had described. Dopamine, they discovered, acts as an intermediate reward for a predicted outcome, and not the outcome itself, with the cells that produce it sending signals to other parts of the brain to reinforce beneficial actions. “The direct analogs between these reinforcement learning algorithms and the biological dopamine system is, I think, the biggest interaction between engineering and biology in, I would say, forever,” Sutton said.

The two men developed a wide range of RL algorithms, including policy gradient methods that seek to find the best rules for moving toward a goal, and Monte Carlo methods used to estimate possible outcomes.

Barto earned his B.S. in mathematics in 1970 and his Ph.D. in computer science in 1975, both from the University of Michigan. He eventually became co-director of the Adaptive Networks Laboratory (now the Autonomous Learning Laboratory) at UMass, which since his retirement in 2012 has been run by former students of his.

Sutton was Barto’s first Ph.D. student, earning his degree in 1984. He went from UMass to GTE’s Fundamental Research Lab, where he spent nine years focusing on machine learning. After a period as a self-described “unpaid, part-time ‘Senior Research Scientist'” at UMass, during which he and Barto wrote the first edition of their textbook, he joined the AI department at AT&T Shannon Laboratory. In 2003, he became a professor at the University of Alberta, where he founded the school’s Reinforcement Learning and Artificial Intelligence Laboratory. He is also chief scientific advisor at the Alberta Machine Intelligence Institute, one of three Canadian national AI institutes, which he said has been instrumental to his award-winning work. He is also a research scientist at AI startup Keen Technologies.

Barto and Sutton co-authored Reinforcement Learning: An Introduction, the principal textbook in the field, in 1998, with an expanded second edition released in 2018. The men say other researchers have also played important roles in the field. “It’s not just us,” Barto said. “A lot of people have contributed to aspects of this over the years. I’ve had a number of very wonderful, bright students who built pieces of this whole story.”

A society of learners

Barto expects the field eventually will move toward multi-agent RL, in which a society of neural networks and their individual reward systems interact. That could lead to cooperative arrangements, in which the networks reward each other to encourage progress toward a common goal, but it could also lead to systems where different networks with different goals have conflicts of interest. Such interactions could have implications in complicated fields such as economics and game theory, Barto said.

Sutton said the development of AI has much farther to go, including toward artificial general intelligence, in which a machine can understand anything that a human can. RL will play an important role in such developments, he said.

Barto’s advice to young computer scientists is to do what he and Sutton did: follow their own interests regardless of what the rest of the field thinks of them. “It’s difficult, but I think you have to have your own motivation to pursue things and stick with it if you can,” he said.

As for Sutton, “My recommendation is to write, to challenge your thoughts by writing them down.”

Neither has made any firm decisions on what to do with the $1-million prize they will share as part of the Turing Award. Sutton said he will probably donate his share to the Openmind Research Institute he co-founded, which aims to give young scientists the luxury he and Barto had to explore basic questions that interest them, without any additional responsibilities. Barto said he hopes to use his portion to fund fellowships for graduate students at UMass.

Sutton is optimistic about the future of computer science. “AI will be figured out over the next few decades,” he said. “It will be maybe the greatest intellectual advance ever. So it’s an honor to have contributed a small part to that.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 人工智能 图灵奖 神经网络
相关文章