Spike-based reinforcement learning in continuous state and action space: when policy gradient methods fail

Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous...

Full description

Saved in:

Bibliographic Details
Published in	PLoS computational biology Vol. 5; no. 12; p. e1000586
Main Authors	Vasilaki, Eleni, Frémaux, Nicolas, Urbanczik, Robert, Senn, Walter, Gerstner, Wulfram
Format	Journal Article
Language	English
Published	United States Public Library of Science 01.12.2009 Public Library of Science (PLoS)
Subjects	Action potentials (Electrophysiology) Algorithms Animals Behavior Computational Biology - methods Computer Simulation Decomposition Experiments Maze Learning - physiology Methods Models, Neurological Neurons Neurons - physiology Neuroscience/Theoretical Neuroscience Physiological aspects Probability Random variables Rats Reward Signal Transduction Studies Synaptic Potentials - physiology Synaptic Potentials Algorithms Animals Maze Learning Signal Transduction Neurons Computer Simulation Computational Biology Rats Models, Neurological Reward
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Conceived and designed the experiments: EV WG. Performed the experiments: EV. Analyzed the data: EV. Wrote the paper: WG. Partially wrote the paper: EV. Participated in discussions: NF RU WS. Partially wrote the Methods section: RU WS.
ISSN:	1553-7358 1553-734X 1553-7358
DOI:	10.1371/journal.pcbi.1000586