Rich Sutton's Talks 


First, a quick guide to the highlights, roughly in order of the talk's potential current interest:
historical interest:


Some of these are .mov files.  It is best to download them and use a Quicktime viewer.

Critterbot Project Overview (Aug 6, 2008)

Mind and Time: A View of Constructivist Reinforcement Learning

This an invited talk I gave at the European Workshop on Reinforcement Learning in summer 2008.  The basic idea is that in order to learn fast it is necessary to learn slow, that the key to fast reinforcement learning is to prepare for it by a slow continual process of constructing a model of the world's state and dynamics.  Although I don't know exactly how to do this, I have many ideas and suggestions, and an outline of how to proceed.  I try to communicate these in this talk.

New Temporal-Difference Methods Based on Gradient Descent (USC 2/18/09)

ABSTRACT: Temporal-difference methods based on gradient descent and parameterized function approximators form a core part of the modern field of reinforcement learning and are essential to many of its large-scale applications.  However, the most popular methods, including TD(lambda), Q-learning, and Sarsa, are not true gradient-descent methods and, as a result, the conditions under which they converge are narrower and less robust than can usually be guaranteed for gradient-descent methods.  In this paper we introduce a new family of temporal-difference algorithms whose expected updates are in the direction of the gradient of a natural performance measure that we call the "mean squared projected Bellman error".  Because these are true gradient-descent methods, we are able to apply standard techniques to prove them convergent and stable under general conditions including, for the first time, off-policy training. The new methods are of the same order of complexity as TD(lambda) and, when TD(lambda) converges, they converge at a similar rate to the same fixpoints.  The new methods are similar to GTD(0) (Sutton, Szepesvari & Maei, 2009), but based on a different objective function and much more efficient, as we demonstrate in a series of computational experiments.

How simple can mind be? (International Workshop on Natural and Artificial Cognition, University of Oxford 6/26/07)

On the Role of Tracking in Stationary Environments (ICML'07 6/21/07) Associated paper.

ABSTRACT: It is often thought that learning algorithms that track the best solution, as opposed to converging to it, are important only on nonstationary problems. We present three results suggesting that this is not so. First we illustrate in a simple concrete example, the Black and White problem, that tracking can perform better than any converging algorithm on a stationary problem. Second, we show the same point on a larger, more realistic problem, an application of temporal-difference learning to computer Go. Our third result suggests that tracking in stationary problems could be important for meta-learning research (e.g., learning to learn, feature selection, transfer). We apply a meta-learning algorithm for step-size adaptation, IDBD,e to the Black and White problem, showing that meta-learning has a dramatic long-term effect on performance whereas, on an analogous converging problem, meta-learning has only a small second-order effect. This small result suggests a way of eventually overcoming a major obstacle to meta-learning research: the lack of an independent methodology for task selection.


Stimulus Representation in Temporal-Difference Models of the Dopamine System (Cal Tech 6/4/07)

The neurotransmitter dopamine plays an important role in the processing of reward-related information in the brain. A prominent theory of this function is that the phasic firing of dopamine neurons encodes a reward prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most of these TD models of the dopamine system have assumed a "complete serial compound" representation in which every moment within a trial is represented distinctly with no similarity to neighboring moments. In this paper we present a more realistic temporal representation in which external stimuli spawn a series of internal microstimuli which grow weaker and more diffuse over time. We show that if these microstimuli are used as inputs to the TD model, then its match to experimental data is improved for hitherto problematic cases in which reward is omitted or received early.  We also note that the new model never produces large negative errors, suggesting that a second neurotransmitter for representing negative errors may not be necessary. Generally, we conclude that choosing a stimulus representation with a more realistic temporal profile can significantly alter the predictions of the TD model of dopamine function.


Experience-Oriented Artificial Intelligence (Machine Learning Seminar at the University of Toronto, 4/3/06)

If intelligence is a computation, then the temporal stream of sensations is its input, and the temporal stream of actions is its output. These two intermingled time series make up experience.  They are the basis on which all intelligent decisions are made and the basis on which those decisions are judged. A focus on experience has implications for many aspects of AI; in this talk we consider its implications for knowledge representation. I propose that it is possible and desirable for an AI agent's knowledge of the world to be expressed entirely as predictions about its low-level experience. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what will happen if we try to sit. The predictive approach is appealing because it connects knowledge directly to data, allowing knowledge to be autonomously verified and tuned, perhaps even learned. However, there is a tremendous gap between human-level knowledge (e.g., about space, objects, people, or water) and low-level experience.  The purpose of this talk is to present some recent work suggesting how this gap might someday be bridged.  I describe a series of small experiments in which extensions of reinforcement learning methods are used to learn predictive representations of abstract commonsense knowledge in micro-worlds. These are first steps on a long journey toward understanding how a mind might make sense of the blooming, buzzing confusion of its sensori-motor experience.


Predictive Representations of State and Knowledge (ICML'05 workshop on Rich Representations for Reinforcement Learning, 8/7/05)

What is knowledge? The empiricist answer, dating back to the 19th century, is that knowledge is the ability to predict. In a modern version of this idea, reinforcement learning researchers have proposed that artificial agents should represent their knowledge as predictions of their low-level sensations and actions.  This predictive representations (PR) approach is appealing because it connects knowledge directly to data, thereby facilitating learning and clarifying semantics.  Most PR research has emphasized representing the world's _state_.  In this talk I will survey the main results and mathematical ideas of that work. A natural follow on, just beginning to be explored, is to use PRs for all kinds of world knowledge, of dynamics as well as of state, of abstractions as well as specifics.  I will survey this work as well and attempt to make vivid the potential of PRs for artificial intelligence.


Grounding knowledge in subjective experience (provocative remarks at the 2nd Cognitive Systems Conference, 5/20/05)


Experience-Oriented Artificial Intelligence (McGill 11/30/05)

I propose that experience - the explicit sequence of actions and sensations over an agent's life - should play a central role in all aspects of artificial intelligence. In particular:

1. Knowledge representation should be in terms of experience. Recent work has shown that a surprisingly wide range of world knowledge can be expressed as predictions of experience, enabling it to be automatically verified and tuned, and grounding its meaning in data rather than in human understanding.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

Much here has yet to be worked out. Each of the "should"s above can also be read as a "could", or even a "perhaps could". I am optimistic and enthusiastic because of the potential for developing a compact and powerful theory of AI in the long run, and for many easy experimental tests in the short run.


Grounding Commonsense Knowledge in Question Networks.  (University of Michigan 9/28/04)

A long-standing challenge in artificial intelligence has been to relate the kind of commonsense knowledge that people have about the world (for example, about space, objects, people, trees and water) to the low-level stream of sensations and actions.  In this talk, we present new work that brings us a few steps closer to realizing this goal.  We introduce the idea of question networks, a way of expressing arbitrary machine-readable questions about future sensations and actions, and a temporal-difference algorithm for learning answers to the questions.  In a series of small experiments, we illustrate the learning efficency of these methods and their ability to handle non-Markov problems.  Finally, we present their extension to temporally abstract knowledge in terms of closed-loop macro-actions known as options.  Overall, we argue that these steps bring us qualitatively closer to understanding the blooming, buzzing confusion of sensori-motor experience.


Temporal Difference Networks.  Presented at NIPS-04. Larger Version.

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.


Knowledge Representation in TD Networks (AAAI Symposium on MDPs and POMDPs: Advances and Challenges (7/26/04) Large (1024 x 768) version

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.


Toward a Computational Theory of Intelligence -- iCORE talk on Reinforcement Learning and Artificial Intelligence (University of Calgary 2/25/04). Video here.

This talk was to a general university audience (videocast to U. Alberta and U. Lethbridge). To showcase the ideas and power of RL, i collected a bunch of videos from other peoples' work. It's not often you can do this appropriately, but I think it was ok this time, and certainly it was fun. The accompanying videos:
All save the last are quicktimable and will play directly in safari. The last seems to require mplayer.


Adapting bias by gradient descent: An incremental version of delta-bar-delta (University of Alberta 2/2/04)

Appropriate bias is widely viewed as the key to efficient learning and generalization. I present a new algorithm, the Incremental Delta-Bar-Delta (IDBD) algorithm, for the learning of appropriate biases based on previous learning experience. The IDBD algorithm is developed for the case of a simple, linear learning system---the LMS or delta rule with a separate learning-rate parameter for each input. The IDBD algorithm adjusts the learning-rate parameters, which are an important form of bias for this system. Because bias in this approach is adapted based on previous learning experience, the appropriate testbeds are drifting or non-stationary learning tasks. For particular tasks of this type, I show that the IDBD algorithm performs better than ordinary LMS and in fact finds the optimal learning rates. The IDBD algorithm extends and improves over prior work by Jacobs and by me in that it is fully incremental and has only a single free parameter. This paper also extends previous work by presenting a derivation of the IDBD algorithm as gradient descent in the space of learning-rate parameters. Finally, I offer a novel interpretation of the IDBD algorithm as an incremental form of hold-one-out cross validation.


From Markov Decision Processes to Artificial Intelligence (University of Alberta 5/14/03)

The path to general, human-level intelligence may go through Markov decision processes (MDPs), a discrete-time, probabilistic formulation of sequential decision problems in terms of states, actions, and rewards. Developed in the 1950s, MDPs were extensively explored and applied in operations research and engineering before coming to the attention of artificial intelligence researchers about 15 years ago. Much of the new interest has come from the field of reinforcement learning, where novel twists on classical dynamic programming methods have enabled the solution of more and vastly larger problems, such as backgammon (Tesauro, 1995) and elevator control (Crites and Barto, 1996). Despite remaining technical issues, real progress seems to have been made toward general learning and planning methods relevant to artificial intelligence. We suggest that the MDP framework can be extended further, to the threshold of human-level intelligence, by abstracting and generalizing each of its three components - actions, states, and rewards. We briefly survey recent work on temporally abstract actions (Precup, 2000; Parr, 1998), predictive representations of state (Littman et al., 2002), and non-reward subgoals (Sutton, Precup & Singh, 1998) to make this suggestion.


Reinforcement Learning's Computational Theory of Mind (Rutgers Psychology 2/14/03)

The reinforcement learning approach to understanding intelligence is now about 20 years old, which should be time enough for a mature perspective on what it is and what it has contributed. Reinforcement learning methods, particularly temporal-difference learning, have been widely used in control and robotics applications, in playing games such as chess and backgammon, in operations research, and as models of animal learning and neural reward systems. Holding these diverse applications together, and posing as a fundamental statement about cognition and decision-making, is a computational theory (in the sense of Marr) of mind. Reinforcement learning methods are centered around the interaction and simultaneous evolution of two primary functional objects, the policy, which says what to do in each situation, and the value function, which says how desirable it is to be in each situation. In this talk, I will survey several examples of reinforcement learning in the attempt to make this underlying theory vivid. Finally, I will mention some of the theory's limitations and shortcomings, and ongoing efforts to make it relevant to the extremely powerful and flexible cognition that we see in humans.


Experience-Oriented Artificial Intelligence (Nov 2002)

I propose that experience - the explicit sequence of actions and sensations over an agent's life - should play a central role in all aspects of artificial intelligence. In particular:

1. Knowledge representation should be in terms of experience. Recent work has shown that a surprisingly wide range of world knowledge can be expressed as predictions of experience, enabling it to be automatically verified and tuned, and grounding its meaning in data rather than in human understanding.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

Much here has yet to be worked out. Each of the "should"s above can also be read as a "could", or even a "perhaps could". I am optimistic and enthusiastic because of the potential for developing a compact and powerful theory of AI in the long run, and for many easy experimental tests in the short run.

[some of this is joint work with Doina Precup, Michael Littman, Satinder Singh & Peter Stone]


Artificial Intelligence Should Be About Predictions (AT&T 12/7/01)

What keeps the knowledge in an AI system correct? Usually people do, but that is a dead end; eventually the AI must do it itself. Building AIs that can maintain their own knowledge is probably the greatest single challenge facing AI today.

It would be relatively easy to self-maintain knowledge if it were expressed as predictions: you would predict something and then see what actually happened. In this talk I propose that much of our knowledge of the world can be expressed as predictions that can be verified in this way. Certainly much of our everyday decision-making is based on predictions about alternative alternative courses of action. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what would happen if we try to sit. Emphasizing ideas rather than technical details, I will describe some of the challenges to this predictive view and partial solutions. The main challenge is to be able to express in predictive form the wide variety of knowledge we have of the world. This can be done in large part by allowing the predictions to be conditional on action and to terminate flexibly, as in the "options" framework. A second challenge is to be fully grounded, to relate the meaning of predictions directly to data. Finally, we consider the pragmatic challenges: how to make progress with these ideas? Building a self-maintaining AI based on predictive knowledge is not difficult, but requires new ways of thinking, determination to do it right, and a willingness to proceed slowly.


We Have Not Yet Begun to Learn (19th Reinforcement Learning Workshop, AT&T 9/20/01)


Mind is About Predictions (Northeastern 7/31/01)

In this talk I will describe recent research in artificial intelligence which has given greater credance to the old idea that much of our knowledge of the world is in the form of predictions. From the blooming, buzzing confusion we extract what is predictable, and in so doing discover useful concepts and ways of behaving. Certainly, much of our everyday reasoning and decision making is based on predictions about alternative courses of action. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what will happen if we try to sit. In this talk I will briefly cover three ideas: 1) an expanded notion of prediction capable of expressing a broad range of knowledge, 2) a kind of planning, or reasoning, as the combination of predictions to yield new predictions, and 3) a way of representing the state of the world (as well as its dynamics) as predictions. All this suggests that working with predictions is what the mind is all about---that predictions are the coin of the mental realm.

(Some of the newer bits of this are joint work with Michael Littman, Doina Precup, and Satinder Singh; also many thanks to David McAllester for constructive criticism.)


Off-policy temporal-difference learning with function approximation (ICML 7/1/01)

We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(lambda) over state-action pairs with importance sampling ideas from our previous work. We prove that, given training under any epsilon-soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem, showing reduced variance compared to the most obvious importance sampling algorithm for this problem. Our current results are limited to episodic tasks with episodes of bounded length.


Overcoming the Curse of Dimensionality with Reinforcement Learning (MIT ORC 4/19/01)

Technological advances in the last few decades have made computation and memory vastly cheaper and thus available in massive quantities. The field of reinforcement learning attempts to take advantage of this trend when solving large-scale stochastic optimal control problems. Dynamic programming can solve small instances of such problems, but suffers from Bellman's "curse of dimensionality," the tendency of the state space and thus computational complexity to scale exponentially with the number of state variables (and thus to quickly exceed even the "massive" computational resources now available). Reinforcement learning brings in two new techniques: 1) parametric approximation of the value function, and 2) sampling of state trajectories (rather than sweeps through the state space). These enable finding approximate solutions, improving in quality with the available computational resources, on problems too large to even be attempted with conventional dynamic programming. However, these techniques also complicate theory, and there remain substantial gaps between the reinforcement learning methods proven effective and those that appear most effective in practice. In this talk, I present results extending the convergence result of Tsitsiklis and Van Roy for on-policy evaluation with linear function approximation to the off-policy case, reviving the possibility of convergence results for value-based off-policy control methods such as Q-learning. I also present an application to RoboCup soccer illustrating the linear approach to function approximation. (This is joint work with Doina Precup, Satinder Singh, Peter Stone, and Sanjoy Dasgupta.)


The Right Way to do Reinforcement Learning with Function Approximation (NIPS'00 12/2/00)


From Reflex to Reason (Cornell 12/8/00)

How close are we to a computational understanding of the mind? Perhaps closer than is usually thought. In this talk I discuss a small set of principles drawn from reinforcement learning and other parts of artificial intelligence that cover a broad range of mental phenomena, from reflexes through various kinds of learning, planning, and reasoning. These principles include rewards, value functions, state-space search, and, as I emphasize in this talk, representing our knowledge of the world as predictions of future observations. First, I show how predictive representations provide a new theory of that simplest of learning phenomena, Pavlovian conditioning or the learning of replexes. Second, I briefly outline how model-based reinforcement learning with mental simulation can serve as a theory of reasoning. I argue that representing knowledge as predictions, including the possibility of action-contingent and temporally indefinite predictions, solves critical problems in the semantics and grounding of classical symbolic approaches to knowledge representation.


Toward Grounding Knowledge in Prediction (CEC2000 7/18/00)

Any attempt to build intelligent machines must come to grips with the question of knowledge, of what kind of information about the world the machine stores and manipulates. Traditionally there have been two approaches, the horns of a dilemma. One uses verbal statements like "John loves Mary" or "Socrates is a man" whose meaning is clear only to people, not to machines; such knowledge is ungrounded. The other uses mathematical statements like differential equations or transition matrices which, although clear and grounded, have never seemed adequate for expressing the commonsense knowledge we all have about the world and use everyday. In this talk we suggest that this dilemma can be broken by grounding knowledge in an enlarged notion of conditional prediction. In particular, if we allow predictions conditional on outcomes (as in Precup, 2000; Parr, 1999) then much more can be expressed as predictions without losing grounding and mathematical clarity. In addition, this approach suggests a radical theory of reasoning---combining knowledge to yield new knowledge---as simple composition of predictions.


A Least Common Denominator for Temporal Abstraction in Reinforcement Learning (NIPS workshop 12/5/98)


Improved Switching Among Temporally Abstract Actions (NIPS 12/2/98)

In robotics and other control applications it is commonplace to have a pre-existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs.


Reinforcement Learning: How Far Can It Go? (Past, Present, and Future) (ICML/COLT/UAI 7/25/98, Extended abstract)


Between MDPs and Semi-MDPs (Stanford 3/5/98)

A key challenge for AI is how to learn, plan, and represent knowledge at multiple levels of temporal abstraction. In this talk I develop an approach based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). The usual framework is extended to include closed-loop multi-step options---whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options can be used interchangeably with primitive actions in reinforcement learning and planning methods, and can be analyzed in terms of a generalized kind of MDP known as a semi-Markov decision process (SMDP) (e.g., Puterman, 1994; Bradtke and Duff, 1995; Parr, 1998; Precup and Sutton, 1997). In this talk I focus on the interplay between the MDP and SMDP levels of analysis. I show how a set of options can be improved by changing their termination conditions to improve over SMDP planning methods with no additional cost. I also present novel intra-option temporal-difference methods that substantially improve over SMDP methods. Finally, I discuss how options themselves can be learned, introducing a new notion of subgoal and subtask into reinforcement learning. Overall, I argue that options and models of options provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge. (Joint work with Doina Precup and Satinder Singh.)


Reinforcement Learning: A Tutorial (GP-98 7/23/98)

Reinforcement learning is learning about, from, and while interacting with a environment in order to achieve a goal. In other words, it is a relatively direct model of the learning that people and animals do in their normal lives. In the last two decades, this age-old problem has come to be much better understood by integrating ideas from psychology, optimal control, artificial neural networks, and artificial intelligence. New methods and combinations of methods have enabled much better solutions to large-scale applications than had been possible by all other means. This tutorial will provide a top-down introduction to the field, covering Markov decision processes and approximate value functions as the formulation of the problem, and dynamic programming, temporal-difference learning, and Monte Carlo methods as the principal solution methods. The role of neural networks and planning will also be covered. The emphasis will be on understanding the capabilities and appropriate role of each of class of methods within in an integrated system for learning and decision making


Reinforcement Learning: Lessons for Artificial Intelligence (IJCAI 8/28/97)

The field of reinforcement learning has recently produced world-class applications and, as we survey in this talk, scientific insights that may be relevant to all of AI. In my view, the main things that we have learned from reinforcement learning are 1) the power of learning from experience as opposed to labeled training examples, 2) the central role of modifiable evaluation functions in organizing sequential behavior, and 3) that learning and planning could be radically similar.


Reinforcement Learning and Information Access (AAAI-SS 3/26/96)  


Constructive Induction Needs a Methodology based on Continuing Learning (ICML94, Workshop on Constructive Induction, panel remarks)