MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a...

Full description

Saved in:
Bibliographic Details
Published inProbability in the engineering and informational sciences Vol. 29; no. 1; pp. 51 - 76
Main Authors Cowan, Wesley, Katehakis, Michael N.
Format Journal Article
LanguageEnglish
Published New York, USA Cambridge University Press 01.01.2015
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round. In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”.
Bibliography:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-1
ObjectType-Feature-2
content type line 23
ISSN:0269-9648
1469-8951
DOI:10.1017/S0269964814000217