Cutting to the chase with warm-start contextual bandits

Multi-armed bandits achieve excellent long-term performance in practice and sublinear cumulative regret in theory. However, a real-world limitation of bandit learning is poor performance in early rounds due to the need for exploration—a phenomenon known as the cold-start problem. While this limitati...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge and information systems Vol. 65; no. 9; pp. 3533 - 3565
Main Authors	Oetomo, Bastian, Perera, R. Malinga, Borovica-Gajic, Renata, Rubinstein, Benjamin I. P.
Format	Journal Article
Language	English
Published	London Springer London 01.09.2023 Springer Nature B.V
Subjects	Algorithms Automation Cold Computer Science Data Mining and Knowledge Discovery Database Management Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Knowledge Multi-armed bandit problems Regular Paper Training Multi-armed bandits Pre-training Warm-start
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Multi-armed bandits achieve excellent long-term performance in practice and sublinear cumulative regret in theory. However, a real-world limitation of bandit learning is poor performance in early rounds due to the need for exploration—a phenomenon known as the cold-start problem. While this limitation may be necessary in the general classical stochastic setting, in practice where “pre-training” data or knowledge is available, it is natural to attempt to “warm-start” bandit learners. This paper provides a theoretical treatment of warm-start contextual bandit learning, adopting Linear Thompson Sampling as a principled framework for flexibly transferring domain knowledge as might be captured by bandit learning in a prior related task, a supervised pre-trained Bayesian posterior, or domain expert knowledge. Under standard conditions, we prove a general regret bound. We then apply our warm-start algorithmic technique to other common bandit learners—the ϵ -greedy and upper-confidence bound contextual learners. An upper regret bound is then provided for LinUCB. Our suite of warm-start learners are evaluated in experiments with both artificial and real-world datasets, including a motivating task of tuning a commercial database. A comprehensive range of experimental results are presented, highlighting the effect of different hyperparameters and quantities of pre-training data.
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-023-01861-2