TY - JOUR

T1 - Closing the gap

T2 - A learning algorithm for lost-sales inventory systems with lead times

AU - Zhang, Huanan

AU - Chao, Xiuli

AU - Shi, Cong

N1 - Funding Information:
History: Accepted by Yinyu Ye, optimization. Funding: The research of H. Zhang is partially supported by the National Science Foundation (NSF) Civil, Mechanical and Manufacturing Innovation (CMMI) [Grants 1362619, 1634505, and 1634676]. The research of X. Chao is partially supported by NSF CMMI [Grants 1362619 and 1634676]. The research of C. Shi is partially supported by NSF CMMI [Grants 1362619 and 1634505]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/mnsc.2019.3288.

PY - 2020/5

Y1 - 2020/5

N2 - We consider a periodic-review, single-product inventory system with lost sales and positive lead times under censored demand. In contrast to the classical inventory literature, we assume the firm does not know the demand distribution a priori and makes an adaptive inventory-ordering decision in each period based only on the past sales (censored demand) data. The standard performance measure is regret, which is the cost difference between a learning algorithm and the clairvoyant (full-information) benchmark. When the benchmark is chosen to be the (full-information) optimal base-stock policy, Huh et al. [Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009a) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand. Math. Oper. Res. 34(2):397-416.] developed a nonparametric learning algorithm with a cubic-root convergence rate on regret. An important open question is whether there exists a nonparametric learning algorithm whose regret rate matches the theoretical lower bound of any learning algorithms. In this work, we provide an affirmative answer to this question. More precisely, we propose a new nonparametric algorithm termed the simulated cycle-update policy and establish a square-root convergence rate on regret, which is proven to be the lower bound of any learning algorithm. Our algorithm uses a random cycle-updating rule based on an auxiliary simulated system running in parallel and also involves two new concepts, namely the withheld on-hand inventory and the double-phase cycle gradient estimation. The techniques developed are effective for learning a stochastic system with complex system dynamics and lasting impact of decisions.

AB - We consider a periodic-review, single-product inventory system with lost sales and positive lead times under censored demand. In contrast to the classical inventory literature, we assume the firm does not know the demand distribution a priori and makes an adaptive inventory-ordering decision in each period based only on the past sales (censored demand) data. The standard performance measure is regret, which is the cost difference between a learning algorithm and the clairvoyant (full-information) benchmark. When the benchmark is chosen to be the (full-information) optimal base-stock policy, Huh et al. [Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009a) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand. Math. Oper. Res. 34(2):397-416.] developed a nonparametric learning algorithm with a cubic-root convergence rate on regret. An important open question is whether there exists a nonparametric learning algorithm whose regret rate matches the theoretical lower bound of any learning algorithms. In this work, we provide an affirmative answer to this question. More precisely, we propose a new nonparametric algorithm termed the simulated cycle-update policy and establish a square-root convergence rate on regret, which is proven to be the lower bound of any learning algorithm. Our algorithm uses a random cycle-updating rule based on an auxiliary simulated system running in parallel and also involves two new concepts, namely the withheld on-hand inventory and the double-phase cycle gradient estimation. The techniques developed are effective for learning a stochastic system with complex system dynamics and lasting impact of decisions.

UR - http://www.scopus.com/inward/record.url?scp=85084915767&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85084915767&partnerID=8YFLogxK

U2 - 10.1287/mnsc.2019.3288

DO - 10.1287/mnsc.2019.3288

M3 - Article

AN - SCOPUS:85084915767

VL - 66

SP - 1962

EP - 1980

JO - Management Science

JF - Management Science

SN - 0025-1909

IS - 5

ER -