Theoretical foundations for Offline Reinforcement Learning

MSR contributions in the space of theoretical foundation for Offline RL

Globally, MSR has made some recent advances in the space of the statistical foundations of Offline RL (opens in new tab), where a central question is to understand what representational conditions (involving the function approximator) and coverage conditions (involving the data distribution) enable sample efficient offline RL in large state spaces. Other theoretical questions about specific algorithms have also been addressed:

  • A natural starting point is to consider linear function approximation, where we have developed a precise understanding of how these conditions affect the performance of traditional linear methods like the least squares temporal difference (LSTD) method.
  • Beyond this, we have established a trade-off between these conditions in a strict sense: Offline RL is possible with (comparatively) strong representation conditions and (comparatively) weak coverage conditions, or vice versa, but it is not possible under weak versions of both conditions. These results raise intriguing questions about Offline RL frameworks or settings that allow for significant weakening of these conditions.
  • Theoretical and empirical analysis of the Reward-conditioned supervised learning approach where we provide tight performance bounds exposing their failure modes (in review).
  • Any non-Markovian# policy admits an occupancy measure (i.e. a distribution of transition samples) that can be reproduced with a Markovian policy. This theoretical result is highly impactful in the Offline RL field because many algorithms (such as the SPIBB family) rely on estimating the baseline by assuming that it is Markovian, which is generally not the case in practice. Our result proves the well-foundedness of this method even when the behavioral is not Markovian (in review).