Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making subject to individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven probabilistic belief about unknown parameters are used to select the...

Full description

Saved in:

Bibliographic Details
Published in	IEEE control systems letters Vol. 6; pp. 2150 - 2155
Main Authors	Park, Hongju, Shirani Faradonbeh, Mohamad Kazem
Format	Journal Article
Language	English
Published	IEEE 2022
Subjects	Adaptive control Analytical models Approximation algorithms Context modeling Covariance matrices Iterative learning control Noise measurement Reinforcement learning Statistical learning Uncertainty
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making subject to individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven probabilistic belief about unknown parameters are used to select the control actions. For this computationally fast algorithm, performance analyses are available under full context-observations. However, little is known for problems that contexts are not fully observed. We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees. Technically, we show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension. Further, we establish rates of learning unknown parameters, and provide illustrative numerical analyses.
ISSN:	2475-1456 2475-1456
DOI:	10.1109/LCSYS.2021.3137269