Safe Policy Improvement with Baseline Bootstrapping
- Romain Laroche ,
- Paul Trichelair
European Workshop for Reinforcement Learning (EWRL) |
In this paper, we consider the Batch Reinforcement Learning task and adopt the safe policy improvement (SPI) approach: we compute a target policy guaranteed to perform at least as well as a given baseline policy, approximately and with high probability. Our SPI strategy, inspired by the knows-what-it-knows paradigm, consists in bootstrapping the target with the baseline when the target does not know. We develop a policy-based computationally efficient bootstrapping algorithm, accompanied by theoretical SPI bounds for the tabular case. We empirically show the limits of the existing algorithms on a small stochastic gridworld problem, and then demonstrate that our algorithm not only improve the worst-case scenario but also the mean performance.
Publication Downloads
Implementation of Safe Policy Improvement with Baseline Bootstrapping
May 13, 2019
This project can be used to reproduce the finite MDPs experiments presented in the ICML2019 paper: Safe Policy Improvement with Baseline Bootstrapping, by Romain Laroche, Paul Trichelair, and Rémi Tachet des Combes. For the DQN implementation, please refer to git repository at this address.
Implementation of SPIBB-DQN
May 13, 2019
This project can be used to reproduce the DQN implementation presented in the ICML2019 paper: Safe Policy Improvement with Baseline Bootstrapping, by Romain Laroche, Paul Trichelair, and Rémi Tachet des Combes. For the finite MDPs experiments, please refer to git repository at this address.