As an undergraduate at Stanford University in the mid-1970s, Richard Sutton pored through the school’s library, trying to read everything he could about learning and machine intelligence. What he found disappointed him, because he did not think it really got to the heart of the matter. “It was mostly pattern recognition. It was mostly learning from examples. And I knew from psychology that animals do very different things,” Sutton said. “They do things to get rewards.”
The one person he found whose writing focused on rewards as part of learning was A. Harry Klopf, a scientist studying machine intelligence at the U.S. Air Force Research Laboratory, who believed that neurons were hedonists, seeking rewards. A letter to Klopf led to a lunch meeting, and after Sutton graduated in 1978 with a B.A. in psychology, to a slot in a research project at the University of Massachusetts (UMass) Amherst, designed to test Klopf’s ideas.
Sutton joined Andrew Barto, a postdoctoral fellow who had arrived a year earlier to be part of the research team. With five years of funding from the Air Force and later the National Science Foundation, and no real deliverables beyond a report, Barto and Sutton followed their interests, and eventually wound up developing the modern field of reinforcement learning (RL), a key method in artificial intelligence (AI) that trains neural networks by offering them rewards, much like the chemical boost that neurons get from doing something positive. Thanks to that work, Barto and Sutton are the recipients of the 2024 ACM A.M. Turing Award.
RL is the method that helped Google’s deep neural network AlphaGo beat a human at the game Go in 2016, and is an important component of fine-tuning the large language models that are revolutionizing generative AI. It works on the notion that when a neural network achieves a goal—winning at chess, say—it receives a reward in the form of a numerical value. If it loses, it gets a negative value. The machine learns through trial and error, making different moves, with each being reinforced or penalized.
The concept of RL differs from that of supervised learning, in which data are labeled and used as examples. RL also differs from unsupervised learning, in which a computer tries to extract useful features to find structures in data. While both methods have proved useful in computing, neither is how living brains learn.
RL had been around for a long time before he and Sutton started working on it, Barto pointed out. It was the subject of AI pioneer Marvin Minsky’s thesis in 1954, and the method that IBM computer scientist Arthur Samuel used to train a computer to play checkers. By the 1970s, though, the idea had fallen out of fashion, with most AI researchers focused on expert systems instead. “We had the luxury of being able to be unfashionable,” Barto said, “so it’s astounding that it’s now very fashionable.”
A key development the pair came up with was temporal difference learning. If a computer is trying to learn to play chess, for example, and the reward comes from winning the game, that is not very useful for figuring out individual moves. Instead, the computer tries to predict at each step what the ultimate reward signal will be, and uses those predictions as rewards. If the probability of winning goes up after a move, that counts as an immediate reward. The change in predictions over time—the temporal difference—provides the reinforcement. The next time the computer plays chess, it can try moves that look similar to the ones that increased its prediction of winning.
Temporal difference learning turned out to have implications outside of AI. Neuroscientists studying the dopamine system in the living brain noticed that it behaved similarly to what Barto and Sutton had described. Dopamine, they discovered, acts as an intermediate reward for a predicted outcome, and not the outcome itself, with the cells that produce it sending signals to other parts of the brain to reinforce beneficial actions. “The direct analogs between these reinforcement learning algorithms and the biological dopamine system is, I think, the biggest interaction between engineering and biology in, I would say, forever,” Sutton said.
The two men developed a wide range of RL algorithms, including policy gradient methods that seek to find the best rules for moving toward a goal, and Monte Carlo methods used to estimate possible outcomes.
Barto earned his B.S. in mathematics in 1970 and his Ph.D. in computer science in 1975, both from the University of Michigan. He eventually became co-director of the Adaptive Networks Laboratory (now the Autonomous Learning Laboratory) at UMass, which since his retirement in 2012 has been run by former students of his.
Sutton was Barto’s first Ph.D. student, earning his degree in 1984. He went from UMass to GTE’s Fundamental Research Lab, where he spent nine years focusing on machine learning. After a period as a self-described “unpaid, part-time ‘Senior Research Scientist'” at UMass, during which he and Barto wrote the first edition of their textbook, he joined the AI department at AT&T Shannon Laboratory. In 2003, he became a professor at the University of Alberta, where he founded the school’s Reinforcement Learning and Artificial Intelligence Laboratory. He is also chief scientific advisor at the Alberta Machine Intelligence Institute, one of three Canadian national AI institutes, which he said has been instrumental to his award-winning work. He is also a research scientist at AI startup Keen Technologies.
Barto and Sutton co-authored Reinforcement Learning: An Introduction, the principal textbook in the field, in 1998, with an expanded second edition released in 2018. The men say other researchers have also played important roles in the field. “It’s not just us,” Barto said. “A lot of people have contributed to aspects of this over the years. I’ve had a number of very wonderful, bright students who built pieces of this whole story.”
A society of learners
Barto expects the field eventually will move toward multi-agent RL, in which a society of neural networks and their individual reward systems interact. That could lead to cooperative arrangements, in which the networks reward each other to encourage progress toward a common goal, but it could also lead to systems where different networks with different goals have conflicts of interest. Such interactions could have implications in complicated fields such as economics and game theory, Barto said.
Sutton said the development of AI has much farther to go, including toward artificial general intelligence, in which a machine can understand anything that a human can. RL will play an important role in such developments, he said.
Barto’s advice to young computer scientists is to do what he and Sutton did: follow their own interests regardless of what the rest of the field thinks of them. “It’s difficult, but I think you have to have your own motivation to pursue things and stick with it if you can,” he said.
As for Sutton, “My recommendation is to write, to challenge your thoughts by writing them down.”
Neither has made any firm decisions on what to do with the $1-million prize they will share as part of the Turing Award. Sutton said he will probably donate his share to the Openmind Research Institute he co-founded, which aims to give young scientists the luxury he and Barto had to explore basic questions that interest them, without any additional responsibilities. Barto said he hopes to use his portion to fund fellowships for graduate students at UMass.
Sutton is optimistic about the future of computer science. “AI will be figured out over the next few decades,” he said. “It will be maybe the greatest intellectual advance ever. So it’s an honor to have contributed a small part to that.”