MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT
Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a...
Saved in:
Published in | Probability in the engineering and informational sciences Vol. 29; no. 1; pp. 51 - 76 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
New York, USA
Cambridge University Press
01.01.2015
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round. In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”. |
---|---|
Bibliography: | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0269-9648 1469-8951 |
DOI: | 10.1017/S0269964814000217 |