TY - JOUR
T1 - Cost-Aware Cascading Bandits
AU - Gan, Chao
AU - Zhou, Ruida
AU - Yang, Jing
AU - Shen, Cong
N1 - Funding Information:
Manuscript received July 18, 2019; revised March 31, 2020 and May 19, 2020; accepted May 29, 2020. Date of publication June 10, 2020; date of current version June 26, 2020. The associate editor coordinating the reviewof this manuscript and approving it for publication was Dr. Caroline Chaux. This work was supported in part by the US National Science Foundation under Grant ECCS-1650299. It was presented in part in the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018 [1]. (Chao Gan and Ruida Zhou contributed equally to this work.) (Corresponding author: Chao Gan.) Chao Gan and Jing Yang are with the School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA 16801 USA (e-mail: cug203@psu.edu; yangjing@psu.edu).
Publisher Copyright:
© 1991-2012 IEEE.
PY - 2020
Y1 - 2020
N2 - In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We first consider the setting where the instantaneous cost of pulling an arm is unknown to the learner until it has been pulled. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in O(\log T). We also provide a lower bound for all \alpha-consistent policies, which scales in \Omega (\log T) and matches our upper bound. We then investigate the setting where the instantaneous cost of pulling each arm is available to the learner for its decision-making, and show that a slight modification of the CC-UCB algorithm, termed as CC-UCB2, is order-optimal. The performances of the algorithms are evaluated with both synthetic and real-world data.
AB - In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We first consider the setting where the instantaneous cost of pulling an arm is unknown to the learner until it has been pulled. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in O(\log T). We also provide a lower bound for all \alpha-consistent policies, which scales in \Omega (\log T) and matches our upper bound. We then investigate the setting where the instantaneous cost of pulling each arm is available to the learner for its decision-making, and show that a slight modification of the CC-UCB algorithm, termed as CC-UCB2, is order-optimal. The performances of the algorithms are evaluated with both synthetic and real-world data.
UR - http://www.scopus.com/inward/record.url?scp=85086717037&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086717037&partnerID=8YFLogxK
U2 - 10.1109/TSP.2020.3001388
DO - 10.1109/TSP.2020.3001388
M3 - Article
AN - SCOPUS:85086717037
SN - 1053-587X
VL - 68
SP - 3692
EP - 3706
JO - IRE Transactions on Audio
JF - IRE Transactions on Audio
M1 - 9113431
ER -