The next challenges for reinforcement learning

Published

Recent years have seen great progress for AI. In particular, artificial agents have learned to classify images and recognize speech at near-human level. However, for artificial agents to reach their full potential, they should not only observe, but also act and learn from the consequences of their actions. Learning how to behave is especially important when an agent interacts with humans through natural language, because of the complexity of language and because each person has a different communication style.

Reinforcement learning (RL) is the area of research that is concerned with learning effective behavior in a data-driven way. While RL has been around for at least 30 years, in the last two years it experienced a big boost in popularity by building on recent advances in deep learning. For example, RL played a crucial role in DeepMind’s AlphaGo program that beat a top-level Go player in 2016. And a year earlier, the Deep Q-Network (DQN) method, which combined a classical RL method with deep convolutional networks, learned to play a large number of Atari 2600 games at above-human level. These successes have led MIT technology review to include RL in their list of top 10 technologies of 2017.

Despite the recent success of RL, there is still a lot of work to be done before it will become a mainstream technique. In this blog-post, we look at some of the remaining challenges that are currently being studied. These challenges are reflected in the domains that are being studied. The current trend is to study (simulated) 3D environments. For example, Microsoft’s Project Malmo has made the world of Minecraft available to researchers, and DeepMind has open-sourced their own in-house developed 3D environment.

GigaPath: Whole-Slide Foundation Model for Digital Pathology

Digital pathology helps decode tumor microenvironments for precision immunotherapy. In joint work with Providence and UW, we’re sharing Prov-GigaPath, the first whole-slide pathology foundation model, for advancing clinical research.

These 3D environments focus RL research on challenges such as multi-task learning, learning to remember and safe and effective exploration. Below, we discuss these challenges in more detail.

Multi-task learning

To achieve general AI, an agent should be able to perform many different type of tasks, rather than specializing in just one. Rich Caruana describes multi-task learning as follows in his 1997 paper:

“An approach to inductive transfer that improves generalization by using the
domain information contained in the training signals of related tasks as an inductive bias.”
Achieving multi-task learning is currently one of the big challenges of AI and of RL in particular. The core of this challenge is scalability. It should not take 1000 times as many samples or hours of computation time to learn 1000 different tasks than to learn one single task. Instead, an AI agent should build up a library of general knowledge and learn general skills that can be used across a variety of tasks. This ability is not present in, for example, DQN. While DQN can play a large number of Atari games, there is no learning across tasks; each game is learned from scratch, which is not a scalable approach. Besides being scalable, an AI that learns general skills can also quickly adapt to new tasks, enabling consistent performance in dynamic environments.

Learning to remember

For many real-world tasks, an observation only captures a small part of the full environment state that determines the best action. In such partially observable environments, an agent has to take into account not just the current observation, but also past observations in order to determine the best action.

For example, consider an intelligent agent in the workplace that helps a company support team employee carrying out actions to help address a customer issue. The human employee may ask a customer about a billing issue. That customer might have a home phone, a mobile and an internet account. If the human asks “What’s the outstanding balance on the account?” the agent must remember the course of the conversation to understand which account the human refers to.

Remembering everything in a conversation, however, makes learning a good policy intractable. As humans speak we move from topic-to-topic, changing the subject and looping back again. Some information is very important whereas other information is more tangential. Hence, the challenge is to learn a compact representation that only stores the most salient information.

Safe and effective exploration

Another challenge is exploration. RL is based on a trial-and-error process, in which different (combination of) actions are tried in order to find the action sequence that yields the highest total reward. However, this brings up two fundamental challenges. The first challenge is related to safety. Consider an AI agent that has to learn how to drive a car in the real world. In this case, the AI agent cannot simply try actions at random, because this would have disastrous results. To learn how to drive it should try things it hasn’t done before, but at the same time it should be very careful what it tries, because actions have real consequences.

But even in a simulator, where actions can be safely explored, effective exploration remains a challenge. In complex, sparse-reward tasks, it can be nearly impossible for an RL agent that learns from scratch to stumble upon a reward. Consider an assembly robot that has access to all the different parts that make up a car and has to learn how to assemble a car. The chance that with random behavior it places all the parts in exactly the rights place—and sees a positive reward— is negligible.

To deal with such problems, researchers have been looking into concepts like imitation learning, intrinsic motivation and hierarchical learning. With imitation learning, a human demonstrates what is good behavior and the agent tries to mimic this and potentially improve it further. Intrinsic motivation is based on the notion that behavior is not just the result of external reward, but is also driven by internal desires. For example, people might try something simply out of curiosity. The challenge is to find internal drivers that will move the agent eventually towards external reward. Hierarchical learning decomposes a task into smaller, easier to learn subtasks.

Reinforcement learning at the Montreal lab

At Microsoft Research Montreal, we are working on these grand RL challenges, as well as as additional challenges that are unique to dealing with language. For example, in our research into memory within dialogue systems, we propose the concept of frames, collecting preferences in sets during a conversation and then analyzing the full conversation when making decisions. In other research, we generalized the hierarchical framework to allow for a larger variety of decompositions.

For more details, take a look at some of our published papers and recent talks. We invite you to view our career opportunities.

Continue reading

See all blog posts