Uncertainty-based competition between prefrontal and striatal systems for behavioural control

 

Nathaniel D. Daw, Yael Niv and Peter Dayan

 

The responses of midbrain dopamine neurons during appetitive conditioning, together with other neural and behavioural data, have been taken to suggest that these neurons and their striatal targets subserve learned action selection using a temporal difference (TD) algorithm for reinforcement learning (RL). Despite its evidentiary foundations, this hypothesis provides a picture of action control in the brain that is psychologically, anatomically, and computationally incomplete. In particular, psychologists and behavioural neuroscientists have long appealed to the existence of multiple routes to behavioural control, and recent lesion studies have pointed to the relatively independent existence of not only a ``habit'' system associated with dopamine and the striatum, but also a ``goal-directed'' system more loosely localised to prefrontal regions. We propose a model involving separate, coexisting controllers, which we associate with model-free (such as TD) and model-based approaches to RL. Given two such RL systems, a critical issue is determining which system prevails when they disagree. We propose a Bayesian account of arbitration in which the systems compete for control on the basis of their uncertainty, with the one judged most accurate dominating. This theory offers a new and more comprehensive account of a host of confusing experimental results about the trade off for control between the systems, and has implications well beyond animal conditioning.

Goal-directed and habitual behaviours are distinguished behaviourally in terms of their sensitivity to the value of the consequent outcome. Consider a rat pressing a lever in order to obtain food. Canonically, if the rat continues pressing even after the value of the food is reduced (for instance, by sating the animal), then the behavior is termed habitual. If the rat instead ceases performing the action, it is identified as goal-directed. Experiments have identified several factors influencing whether a particular action will be habitual or goal-directed. For instance, extensive experience with a behaviour tends to shift its control from goal-directed to habitual; in contrast, proximity to of an action to reward (in cases where a sequence of actions must be completed to attain it) favours goal-directed control. There as yet exist few theoretical ideas that would unify or explain this body of data.

Experiments employing targeted lesions suggest that these behavioural categories originate from anatomically distinct controllers -- notably, either sort of behaviour can be separately inactivated. These results, and fMRI and unit recordings all suggest that habits are associated with dopamine and the dorsal striatum, while goal-directed actions seem to be controlled by a cortical network, centred on prefrontal regions. These latter areas have previously played a limited role in the application of RL theories to the brain.

We suggest that this psychological and neural distinction parallels a computational distinction between "model-based" and "model-free" approaches to RL. These are two approaches to the special difficulties of choice in sequential tasks (such as mazes), in which the rewarding or punishing consequences of an action may be deferred. Model-based approaches build and maintain a representation (or "model") of the immediate consequences of each action, and take decisions by searching iteratively through a chain of contemplated behaviours to predict their long term consequences. Model-free algorithms such as TD use caching to avoid search --- they store the predicted long-term consequences of a choice, so that a decision can be made by comparing the stored values directly.

We suggest that the brain implements both strategies in parallel: a TD controller (similar to previous theories, involving dopamine and the basal ganglia, and associated with habits) and a (newly proposed), prefrontal "goal-directed" controller that evaluates actions by searching a model. Importantly, the profiles of devaluation sensitivity used behaviourally to characterise goal-directed and habitual responding are also hallmarks of these two computational control strategies. In response to a contingency change such as reward devaluation, a model-based controller (like the goal-directed system) will immediately encounter the changed outcome and adjust its action preferences accordingly. In contrast, without extensive relearning, the cached values used by a TD controller will continue to reflect the previous valuation, and so, like the habit system, a TD system will persist in suboptimal behaviour.

Model-based and model-free control are suited to different situations; this observation provides us with a principled explanation of the behavioural data concerning the circumstances that promote each. For instance, model-free systems like TD learn cached values by a process of ``bootstrapping''. Thus they use data less (statistically) efficiently than model-based approaches and are therefore contraindicated early in training. However, planning becomes exponentially more time-consuming and, in a computationally limited system, inaccurate, for deeper searches, so a model-based controller is most useful for actions nearer to the goal.

These considerations suggest a theory of how the brain might arbitrate between the competing controllers. We consider Bayesian versions of the RL algorithms, in which each controller keeps track of its uncertainty, or of the expected error in its action valuations, with the most reliable controller prevailing. Uncertainty is commonly invoked in neuroscience and psychology for combining separate sources of data (as in polysensory integration); here we extend this approach to resolving information derived from different computational algorithms. Our simulations show that this proposed uncertainty-based arbitration accounts for much of the behavioural data.

While the structure of planning has been studied most directly in animal conditioning, our theory is relevant to a number of other areas. Theorists have previously invoked similar two-controller accounts to explain issues such as drug abuse and self-control as resulting from a competition between ``rational'' and ``emotional'' systems. Our theory suggests a very different view, in which two ``rational'' controllers employ different computational strategies in the pursuit of optimality. The theory also has bearing on multiplayer games and economic interactions, where subjects' behaviour can vary systematically depending on whether they model or cache their opponents' countermoves.