Distributional Code in Dopamine-based Reinforcement Learning

Nature, 577, pages 671–675 (2020)

A Distributional Code for Value in Dopamine-based Reinforcement Learning

Will Dabney, et.al.

DeepMind, London, UK

Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, UK

Center for Brain Science, Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA

Gatsby Computational Neuroscience Unit, University College London, London, UK

[paraphrase]

Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.

The reward prediction error (RPE) theory of dopamine derives from work in the artificial intelligence (AI) field of reinforcement learning (RL). Since the link to neuroscience was first made, however, RL has made substantial advances, revealing factors that greatly enhance the effectiveness of RL algorithms. In some cases, the relevant mechanisms invite comparison with neural function, suggesting hypotheses concerning reward-based learning in the brain. Here we examine a promising recent development in AI research and investigate its potential neural correlates. Specifically, we consider a computational framework referred to as distributional reinforcement learning

Similar to the traditional form of temporal-difference RL—on which the dopamine theory was based—distributional RL assumes that reward-based learning is driven by a RPE, which signals the difference between received and anticipated reward. (For simplicity, we introduce the theory in terms of a single-step transition model, but the same principles hold for the general multi-step (discounted return) case. The key difference in distributional RL lies in how ‘anticipated reward’ is defined. In traditional RL, the reward prediction is represented as a single quantity: the average over all potential reward outcomes, weighted by their respective probabilities. By contrast, distributional RL uses a multiplicity of predictions. These predictions vary in their degree of optimism about upcoming reward. More optimistic predictions anticipate obtaining greater future rewards; less optimistic predictions anticipate more meager outcomes. Together, the entire range of predictions captures the full probability distribution over future rewards.

Compared with traditional RL procedures, distributional RL can increase performance in deep learning systems by a factor of two or more, an effect that stems in part from an enhancement of representation learning. This prompts the question of whether RL in the brain might leverage the benefits of distributional coding. This question is encouraged both by the fact that the brain utilizes distributional codes in numerous other domains, and by the fact that the mechanism of distributional RL is biologically plausible. Here we tested several predictions of distributional RL using single-unit recordings in the ventral tegmental area (VTA) of mice performing tasks with probabilistic rewards.

In contrast to classical temporal-difference (TD) learning, distributional RL posits a diverse set of RPE channels, each of which carries a different value prediction, with varying degrees of optimism across channels. (Value is formally defined in RL as the mean of future outcomes, but here we relax this definition to include predictions about future outcomes that are not necessarily the mean.) These value predictions in turn provide the reference points for different RPE signals, causing the latter to also differ in terms of optimism. As a surprising consequence, a single reward outcome can simultaneously elicit positive RPEs (within relatively pessimistic channels) and negative RPEs (within more optimistic ones).

This translates immediately into a neuroscientific prediction, which is that dopamine neurons should display such diversity in ‘optimism’. Suppose an agent has learned that a cue predicts a reward whose magnitude will be drawn from a probability distribution. In the standard RL theory, receiving a reward with magnitude below the mean of this distribution will elicit a negative RPE, whereas larger magnitudes will elicit positive RPEs. The reversal point—the magnitude at which prediction errors transition from negative to positive—in standard RL is the expectation of the magnitude’s distribution. By contrast, in distributional RL, the reversal point differs across dopamine neurons according to their degree of optimism.

Distributional RL offers a range of untested predictions. Dopamine neurons should maintain their ordering of relative optimism across task contexts, even as the specific distribution of rewards changes. If RPE channels with particular levels of optimism are selectively activated with optogenetics, this should sculpt the learned distribution, which should in turn be detectable with behavioural measures of sensitivity to moments of the distribution.

Distributional RL also gives rise to a number of broader questions. What are the circuit- or cellular-level mechanisms that give rise to a diversity of asymmetry in positive versus negative RPE scaling? It is also worth considering whether other mechanisms, aside from asymmetric scaling of RPEs, might contribute to distributional coding. It is well established, for example, that positive and negative RPEs differentially engage striatal D1 and D2 dopamine receptors, and that the balance of these receptors varies anatomically. This suggests a second potential mechanism for differential learning from positive versus negative RPEs. Moreover, how do different RPE channels anatomically couple with their corresponding reward predictions? Finally, what effects might distributional coding have downstream, at the level of action learning and selection? With this question in mind, it is notable that some current theories in behavioural economics centre on risk measures that can be easily read out from the kind of distributional codes that the present work has considered.

Finally, we speculate on the implications of the distributional hypothesis of dopamine for the mechanisms of mental disorders such as addiction and depression. Mood has been linked with predictions of future reward, and it has been proposed that both depression and bipolar disorder may involve biased forecasts concerning value-laden outcomes. It has recently been proposed that such biases may arise from asymmetries in RPE coding,. There are clear potential connections between these ideas and the phenomena we have reported here, presenting opportunities for further research.

[paraphrase]