Integrating Online Learning and Adaptive Control in Queueing Systems with Uncertain Payoffs

We study task assignment in online service platforms where un-labeled clients arrive according to a stochastic process and each client brings a random number of tasks. As tasks are assigned to servers, they produce client/server-dependent random payoffs. The goal of the system operator is to maximiz...

Full description

Saved in:
Bibliographic Details
Published in2018 Information Theory and Applications Workshop (ITA) pp. 1 - 9
Main Authors Wei-Kang Hsu, Jiaming Xu, Xiaojun Lin, Bell, Mark R.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.02.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We study task assignment in online service platforms where un-labeled clients arrive according to a stochastic process and each client brings a random number of tasks. As tasks are assigned to servers, they produce client/server-dependent random payoffs. The goal of the system operator is to maximize the expected payoff per unit time subject to the servers' capacity constraints. However, both the statistics of the dynamic client population and the client-specific payoff vectors are unknown to the operator. Thus, the operator must design task-assignment policies that integrate adaptive control (of the queueing system) with online learning (of the clients' payoff vectors). A key challenge in such integration is how to account for the nontrivial closed-loop interactions between the queueing process and the learning process, which may significantly degrade system performance. We propose a new utility-guided online learning and task assignment algorithm that seamlessly integrates learning with control to address such difficulty. Our analysis shows that, compared to an oracle that knows all client dynamics and payoff vectors beforehand, the gap of the expected payoff per unit time of our proposed algorithm in a finite T horizon is bounded by β 1 /V+β 2 √{logN/N}+β 3 N(V+1)/T, where V is a tuning parameter of the algorithm, and β 1 , β 2 , β 3 only depend on arrival/service rates and the number of client classes/servers. Through simulations, we show that our proposed algorithm significantly outperforms a myopic matching policy and a standard queue-length based policy that does not explicitly address the closed-loop interactions between queueing and learning.
DOI:10.1109/ITA.2018.8503124