Offline Reinforcement Learning

This page introduces the research area of Offline Reinforcement Learning (also sometimes called Batch Reinforcement Learning). It consists in training a target policy from a fixed dataset of trajectories collected with a behavioral policy. In comparison to classic Reinforcement Learning (RL), the learning agent cannot interact with the environment preventing the use of the virtuous trial-and-error feedback loop.

More precisely, in standard RL, the agent directly interacts with the environment while learning. In contrast, with Offline RL, a dataset of trajectories is collected by an agent, called the behavioral, which the learner has no control over. First, the behavioral agent interacts with the environment following a behavioral policy. These interactions form trajectories that are collected and stored in a dataset. Then, on the basis of this trajectory dataset, the Offline RL algorithm is asked to generate a new policy without direct access to the environment. This process is depicted in the figure hereafter.

offline rl diagram

At MSR, we have recorded a tutorial lecture (opens in new tab) on Offline RL and have contributed to algorithmic development (opens in new tab) and theoretical foundations (opens in new tab) for Offline RL.