### Abstract

We consider the solution of a finite-state infinite horizon Markov Decision Process (MDP) in which both the transition matrix and the cost function are misspecified, the latter in a parametric sense. We consider a data-driven regime in which the learning problem is a stochastic convex optimization problem that resolves misspecification. Via such a framework, we make the following contributions: (1) We first show that a misspecified value iteration scheme converges almost surely to its true counterpart and the mean-squared error after K iterations is O(1/K^{1/2-α}) with 0 < α < 1/2; (2) An analogous asymptotic almost-sure convergence statement is provided for misspecified policy iteration; and (3) Finally, we present a constant steplength misspecified Q-learning scheme and show that a suitable error metric is O(1/K^{1/2-α})+O(√δ) with 0 < α < 1/2 after K iterations where δ is a bound on the steplength.

Original language | English (US) |
---|---|

Title of host publication | 2015 Winter Simulation Conference, WSC 2015 |

Publisher | Institute of Electrical and Electronics Engineers Inc. |

Pages | 3801-3812 |

Number of pages | 12 |

ISBN (Electronic) | 9781467397438 |

DOIs | |

State | Published - Feb 16 2016 |

Event | Winter Simulation Conference, WSC 2015 - Huntington Beach, United States Duration: Dec 6 2015 → Dec 9 2015 |

### Publication series

Name | Proceedings - Winter Simulation Conference |
---|---|

Volume | 2016-February |

ISSN (Print) | 0891-7736 |

### Other

Other | Winter Simulation Conference, WSC 2015 |
---|---|

Country | United States |

City | Huntington Beach |

Period | 12/6/15 → 12/9/15 |

### All Science Journal Classification (ASJC) codes

- Software
- Modeling and Simulation
- Computer Science Applications

