Uncertainty-based competition between prefrontal and striatal
systems for behavioural control
The responses of midbrain dopamine neurons during appetitive
conditioning, together with other neural and behavioural data, have
been taken to suggest that these neurons and their striatal targets
subserve learned action selection using a temporal difference (TD)
algorithm for reinforcement learning (RL). Despite its evidentiary
foundations, this hypothesis provides a picture of action control in
the brain that is psychologically, anatomically, and computationally
incomplete. In particular, psychologists and behavioural
neuroscientists have long appealed to the existence of multiple routes
to behavioural control, and recent lesion studies have pointed to the
relatively independent existence of not only a ``habit'' system
associated with dopamine and the striatum, but also a
``goal-directed'' system more loosely localised to prefrontal
regions. We propose a model involving separate, coexisting
controllers, which we associate with model-free (such as TD) and
model-based approaches to RL. Given two such RL systems, a critical
issue is determining which system prevails when they disagree. We
propose a Bayesian account of arbitration in which the systems compete
for control on the basis of their uncertainty, with the one judged
most accurate dominating. This theory offers a new and more
comprehensive account of a host of confusing experimental results
about the trade off for control between the systems, and has
implications well beyond animal conditioning.
Goal-directed and habitual behaviours are distinguished behaviourally
in terms of their sensitivity to the value of the consequent
outcome. Consider a rat pressing a lever in order to obtain
food. Canonically, if the rat continues pressing even after the value
of the food is reduced (for instance, by sating the animal), then the
behavior is termed habitual. If the rat instead ceases performing the
action, it is identified as goal-directed. Experiments have identified
several factors influencing whether a particular action will be
habitual or goal-directed. For instance, extensive experience with a
behaviour tends to shift its control from goal-directed to habitual;
in contrast, proximity to of an action to reward (in cases where a
sequence of actions must be completed to attain it) favours
goal-directed control. There as yet exist few theoretical ideas that
would unify or explain this body of data.
Experiments employing targeted lesions suggest that these behavioural
categories originate from anatomically distinct controllers --
notably, either sort of behaviour can be separately inactivated.
These results, and fMRI and unit recordings all suggest that habits
are associated with dopamine and the dorsal striatum, while
goal-directed actions seem to be controlled by a cortical network,
centred on prefrontal regions. These latter areas have previously
played a limited role in the application of RL theories to the brain.
We suggest that this psychological and neural distinction parallels a
computational distinction between "model-based" and "model-free"
approaches to RL. These are two approaches to the special difficulties
of choice in sequential tasks (such as mazes), in which the rewarding
or punishing consequences of an action may be deferred. Model-based
approaches build and maintain a representation (or "model") of the
immediate consequences of each action, and take decisions by searching
iteratively through a chain of contemplated behaviours to predict
their long term consequences. Model-free algorithms such as TD use
caching to avoid search --- they store the predicted long-term
consequences of a choice, so that a decision can be made by comparing
the stored values directly.
We suggest that the brain implements both strategies in parallel: a TD
controller (similar to previous theories, involving dopamine and the
basal ganglia, and associated with habits) and a (newly proposed),
prefrontal "goal-directed" controller that evaluates actions by
searching a model. Importantly, the profiles of devaluation
sensitivity used behaviourally to characterise goal-directed and
habitual responding are also hallmarks of these two computational
control strategies. In response to a contingency change such as reward
devaluation, a model-based controller (like the goal-directed system)
will immediately encounter the changed outcome and adjust its action
preferences accordingly. In contrast, without extensive relearning,
the cached values used by a TD controller will continue to reflect the
previous valuation, and so, like the habit system, a TD system will
persist in suboptimal behaviour.
Model-based and model-free control are suited to different situations;
this observation provides us with a principled explanation of the
behavioural data concerning the circumstances that promote each. For
instance, model-free systems like TD learn cached values by a process
of ``bootstrapping''. Thus they use data less (statistically)
efficiently than model-based approaches and are therefore
contraindicated early in training. However, planning becomes
exponentially more time-consuming and, in a computationally limited
system, inaccurate, for deeper searches, so a model-based controller
is most useful for actions nearer to the goal.
These considerations suggest a theory of how the brain might arbitrate
between the competing controllers. We consider Bayesian versions of
the RL algorithms, in which each controller keeps track of its
uncertainty, or of the expected error in its action valuations, with
the most reliable controller prevailing. Uncertainty is commonly
invoked in neuroscience and psychology for combining separate sources
of data (as in polysensory integration); here we extend this approach
to resolving information derived from different computational
algorithms. Our simulations show that this proposed uncertainty-based
arbitration accounts for much of the behavioural data.
While the structure of planning has been studied most directly in
animal conditioning, our theory is relevant to a number of other
areas. Theorists have previously invoked similar two-controller
accounts to explain issues such as drug abuse and self-control as
resulting from a competition between ``rational'' and ``emotional''
systems. Our theory suggests a very different view, in which two
``rational'' controllers employ different computational strategies in
the pursuit of optimality. The theory also has bearing on multiplayer
games and economic interactions, where subjects' behaviour can vary
systematically depending on whether they model or cache their
opponents' countermoves.