Teaching a robot to see and navigate with simulation

Published July 1, 2020

By Sebastian Scherer , Associate Research Professor, CMU Ashish Kapoor , General Manager, Autonomous Systems and Robotics Group

Share this page

Editor’s note: This post and its research are the collaborative efforts of our team, which includes Wenshan Wang, Delong Zhu of the Chinese University of Hong Kong; Xiangwei Wang of Tongji University; Yaoyu Hu, Yuheng Qiu, Chen Wang of the Robotics Institute of Carnegie Mellon University; Yafei Hu, Ashish Kapoor of Microsoft Research, and Sebastien Scherer.

The ability to see and navigate is a critical operational requirement for robots and autonomous systems. For example, consider autonomous rescue robots (opens in new tab) that are required to maneuver and navigate in challenging physical environments that humans cannot safely access. Similarly, building AI agents (opens in new tab) that can efficiently and safely control perception-action (opens in new tab) requires thoughtful engineering to enable sensing and perceptual abilities in a robot. However, building a real-world autonomous system that can operate safely at scale is a very difficult task. The partnership between Microsoft Research and Carnegie Mellon University, announced (opens in new tab) in April 2019, is continuing to advance state of the art in the area of autonomous systems through research focused on solving real-world challenges such as autonomous mapping, navigation, and inspection of underground urban and industrial environments.

Simultaneous Localization and Mapping (SLAM) is one of the most fundamental capabilities necessary for robots. SLAM has made impressive progress with both geometric-based methods and learning-based methods; however, robust and reliable SLAM systems for real-world scenarios remain elusive. Real-life environments are full of difficult cases such as light changes or lack of illumination, dynamic objects, and texture-less scenes.

Recent advances in deep reinforcement learning, data-driven control, and deep perception models are fundamentally changing how we build and engineer autonomous systems. Much of the success in the past with SLAM has come from geometric approaches. The availability of large training datasets, collected in a wide variety of conditions, help push the envelope of data-driven techniques and algorithms.

SLAM is fundamentally different and complicated due to the sequential nature of recognizing landmarks (such as buildings and trees) in a dynamic physical environment while driving or flying through it versus the recognition of static images, object recognition, or activity recognition. Secondly, many SLAM systems use multiple sensing modalities, such as RGB, depth cameras, and LiDAR, which makes data collection a considerable challenge. Finally, we believe that a key to solving SLAM robustly is curating data instances with ground truth in a wide variety of conditions, with varying lighting, weather, and scenes—a task that is daunting and expensive to accomplish in the real world with real robots.

We present a comprehensive dataset, the TartanAir, for robot navigation tasks, and more. The large dataset was collected using photorealistic simulation environments based on AirSim (opens in new tab), with various light and weather conditions and moving objects. Our paper (opens in new tab) on the dataset has been accepted and will be appearing at the IEEE/RSJ International Conference on Intelligent Robots and Systems (opens in new tab) (IROS 2020). By collecting data in simulation, we can obtain multi-modal sensor data and precise ground truth labels, including the stereo RGB image, depth image, segmentation, optical flow, camera poses, and LiDAR point cloud. The TartanAir contains a large number of environments with various styles and scenes, covering challenging viewpoints and diverse motion patterns, which are difficult to achieve by using physical data collection platforms. Based on the TartanAir dataset, we are hosting a visual SLAM challenge that kicked off in a Computer Vision and Pattern Recognition (CVPR) 2020 (opens in new tab) workshop. It consists of a monocular track and a stereo track. Each track consists of 16 trajectories containing challenging features that aim to push the limits of visual SLAM algorithms. The goal is to localize the robot and map the environment from a sequence of monocular/stereo images. The deadline to submit entries to the challenge is August 15th, 2020. To learn more about participating, visit the challenge site (opens in new tab).

Read Paper (opens in new tab) Download Link with the Code (opens in new tab) Watch Video (opens in new tab)

We minimize the sim-to-real gap by utilizing a large number of environments with various styles and diverse scenes. A unique goal of our dataset is to focus on the challenging environments with changing light conditions, adverse weather, and dynamic objects. State-of-the-art SLAM algorithms struggle to track the camera pose in our dataset, continuously getting lost on some challenging sequences. We propose a metric to evaluate the robustness of the algorithm. Also, we developed an automatic data collection pipeline for the TartanAir dataset, which allows us to process more environments with minimum human intervention.

Approximately 50 images of various simulated environments. They include: sunsets, futuristic science fiction cityscapes, nature scenes, and both city and rural environments.

Figure 1: A glance at the simulated environments. The TartanAir dataset covers a wide range of scenes, categorized into urban, rural, nature, domestic, public, and science fiction. Environments within the same category also have broad diversity.

Dataset features

We have adopted 30 photorealistic simulation environments. The environments provide a wide range of scenarios that cover many challenging situations. The simulation scenes consist of:

Indoor and outdoor scenes with detailed 3D objects. The indoor environments are multi-room and richly decorated. The outdoor environments include various kinds of buildings, trees, terrains, and landscapes.
Special purpose facilities and ordinary household scenes.
Rural and urban scenes.
Real-life and sci-fi scenes.

In some environments, we simulated multiple types of challenging visual effects.

Hard lighting conditions: day-night alternating, low lighting, and rapidly changing illumination.
Weather effects: clear, raining, snowing, windy, and fog.
Seasonal change.

A 4 by 4 grid of images. Top row: Day-night: a day image of a building, a night image of a street intersection, a shadow cast diagonally across a building during the day, house with its lights shining through two tall windows at night. Row 2: Weather. A gloomy sky at dusk over moors with moon gleaming. An industrial scene with various pipes and a silo with lights from windows in the background. A deciduous forest in winter with snow-covered ground and some long grass peeking through. A residential area with trees surrounding at night--lightning lights up the sky. Row 3: Season. A spring forest with green pine and deciduous trees--red flowers are interspersed among the grass floor. Summer forest, mostly green with sunlight breaking through the trees. Fall with golden leaves hanging on white birch trees. A snow covered forest where trees are bare and a slight fog hangs in the air. Row 4: dynamic objects. A woman walks on a city sidewalk past an ambulance--buildings and an intersection behind her. A metro train platform--people walk past the train in many directions. A high-tech factory assembly line showing a robot arm building a chasis. A Forest knoll with a small waterfall and resulting stream.

Figure 2: Challenging scenarios.

Using AirSim, we can extract various types of ground truth labels, including depth, semantic segmentation tag, and camera pose. From the extracted raw data, we further compute other ground truth labels such as optical flow, stereo disparity, simulated multi-line LiDAR points, and simulated IMU readings.

In each simulated environment, we gather data by following multiple routes and making movements with different levels of aggressiveness. The virtual camera can move slowly and smoothly without sudden jittering actions. Or it can have intense and violent actions mixed with significant rolling and yaw motions.

6 images from left to right: Stereo sequence: the same image of a landscape offset and overlapping. Depth/Segmentation: side-by-side black an white image of a landscape and color-segmented landscape. Camera pose/IMU: A multi-colored grid of a landscape, showing lines running underground. Occupancy map: a heatmap of a multilayered building. Optical flow: a prism of colors washes across a landscape image. Lidar: a elliptical green radar pattern shows over a streetscape.

Figure 3: Examples of sensor data and ground truth labels. a) The data provided by the AirSim interface. b) The ground truth labels we calculated using the data.

We develop a highly automated pipeline to facilitate data acquisition. For each environment, we build an occupancy map by incremental mapping. Based on the map, we then sample several trajectories for the virtual camera to follow. A set of virtual cameras follow the trajectories to capture raw data from AirSim. Raw data are processed to generate labels such as optical flow, stereo disparity, simulated LiDAR points, and simulated IMU readings. At last, we verify the data by warping the image over time and space to make sure they are valid and synchronized.

a) incremental mapping showing the occupancy map described above. An arrow points to b) trajectory mapping that shows an overhead heatmap of a landscape and then a mapping of vectors and lines. An arrow points to c) data processing, showing the optical flow image from figure above and lidar image. An arrow points to d) data verification, showing a source point with lines to three points on a plane: red, white, and blue. An arrow shows reflection to an adjacent plane, where the process is reflected.

Figure 4: Data acquisition pipeline.

Conclusions

Although our data is synthetic, we aim to push the SLAM algorithms toward real-world applications. We hope that the proposed dataset and benchmarks will complement others and help to reduce overfitting to datasets with limited training or testing examples and contribute to the development of algorithms that work well in practice. As our preliminary results show in the experiments on learning-based visual odometry (VO), a simple VO network trained on this diverse TartanAir dataset could directly be generalized to a real-world dataset such as KITTI and EuRoC and outperforms the classical geometrical methods on challenging trajectories. We hope the TartanAir dataset can push the limits of the current visual SLAM algorithms toward real-world applications.

Learn more about AirSim in the upcoming webinar, where Sai Vemprala will demonstrate how it can help train robust and generalizable algorithms for autonomy. Register (opens in new tab) by July 8, 2020.

Related publications

TartanAir: A Dataset to Push the Limits of Visual SLAM

Meet the authors

Sebastian Scherer

Associate Research Professor, CMU

Learn more

Ashish Kapoor

General Manager, Autonomous Systems and Robotics Group

Microsoft Research Blog