The Effects of Uncertainty on TD Learning

 

Yael Niv, Michael O. Duff and Peter Dayan

 

Substantial evidence suggests that the phasic activities of dopamine (DA) neurons in the primate midbrain represent a temporal difference (TD) error in the predictions of future reward. TD offers a computationally compelling account of a role for DA in appetitive classical and instrumental conditioning, and a precise and parsimonious computational theory of the generation of DA firing patterns. Recent DA recordings (Fiorillo et al Science, 2003) in an experiment involving inherently stochastic reward delivery, show activity ramping towards the time of the reward, with ramp height related to the degree of reward uncertainty. Prima facie, these data present a crucial challenge, as both classical and instrumental facets of the TD would require the reliable activity in the ramp to be predicted away by earlier stimuli.

We use analysis and simulations to show that the apparently anomalous ramps are in fact to be expected under a standard TD account, if, as suggested by the low baseline firing rates of the DA cells, positive and negative prediction errors are differentially scaled. Using a simple tapped-delay-line representation of time between the stimulus and the reward (as commonly adopted in TD models), together with a fixed learning rate, a ramp in the DA activity emerges just as in the experimental data.

Analytically deriving the average response at the time of the reward from the TD learning rule, we show that the height of the modelled ramps are indeed proportional to the variance of the rewards, in accordance with the data. There is, however, a key difference between the uncertainty and TD accounts of the ramps. According to the former, ramps are within-trial phenomena, coding uncertainty; by contrast, the latter suggests they arise only through averaging across multiple trials. Under the TD account, the non-stationarity engendered by constant learning from errors makes the PSTH traces potentially misleading, as they average over different trial histories.

Our study suggests that uncertainty need not be playing any explicit part in determining these aspects of DA activity, a conclusion also consistent with various other suggestive results from Fiorillo and others. We also study the effects on TD learning of other sources of representational and learning noise.