Learning good policies from suboptimal demonstrations

Imitating an expert policy is one way to boost reinforcement learning algorithms which, in most cases, rely on random exploration and a huge amount of data. While some major drawbacks of imitation learning, such as compound errors, have been addressed, most results make an implicit hypothesis of having a good (or desired) expert which is usually hard to get in practice. To overcome this obstacle, this paper focuses on learning good policies from suboptimal demonstration data. We systematically investigate the performance of a recent approach under varying assumptions on demonstration quality and show that it performs poorly with suboptimal demonstration data. We then demonstrate the potential to overcome this issue through a performance comparison between learner and demonstrator.