We extended the capabilities of ChatGPT to robotics, and controlled multiple platforms such as robot arms, drones, and home assistant robots intuitively with language.
Have you ever wanted to tell a robot what to do using your own words, like you would to a human? Wouldn’t it be amazing to just tell your home assistant robot: “Please warm up my lunch“, and have it find the microwave by itself? Even though language is the most intuitive way for us to express our intentions, we still rely heavily on hand-written code to control robots. Our team has been exploring how we can change this reality and make natural human-robot interactions possible using OpenAI (opens in new tab)‘s new AI language model, ChatGPT (opens in new tab).
ChatGPT is a language model trained on a massive corpus of text and human interactions, allowing it to generate coherent and grammatically correct responses to a wide range of prompts and questions. Our goal with this research is to see if ChatGPT can think beyond text, and reason about the physical world to help with robotics tasks. We want to help people interact with robots more easily, without needing to learn complex programming languages or details about robotic systems. The key challenge here is teaching ChatGPT how to solve problems considering the laws of physics, the context of the operating environment, and how the robot’s physical actions can change the state of the world.
It turns out that ChatGPT can do a lot by itself, but it still needs some help. Our technical paper describes a series of design principles that can be used to guide language models towards solving robotics tasks. These include, and are not limited to, special prompting structures, high-level APIs, and human feedback via text. We believe that our work is just the start of a shift in how we develop robotics systems, and we hope to inspire other researchers to jump into this exciting field. Continue reading for more technical details on our methods and ideas.
Challenges in robotics today, and how ChatGPT can help
Current robotics pipelines begin with an engineer or technical user that needs to translate the task’s requirements into code for the system. The engineer sits in the loop, meaning that they need to write new code and specifications to correct the robot’s behavior. Overall, this process is slow (user needs to write low-level code), expensive (requires highly skilled users with deep knowledge of robotics), and inefficient (requires multiple interactions to get things working properly).
ChatGPT unlocks a new robotics paradigm, and allows a (potentially non-technical) user to sit on the loop, providing high-level feedback to the large language model (LLM) while monitoring the robot’s performance. By following our set of design principles, ChatGPT can generate code for robotics scenarios. Without any fine-tuning we leverage the LLM’s knowledge to control different robots form factors for a variety of tasks. In our work we show multiple examples of ChatGPT solving robotics puzzles, along with complex robot deployments in the manipulation, aerial, and navigation domains.
Robotics with ChatGPT: design principles
Prompting LLMs is a highly empirical science. Through trial and error, we built a methodology and a set of design principles for writing prompts for robotics tasks:
- First, we define a set of high-level robot APIs or function library. This library can be specific to a particular robot, and should map to existing low-level implementations from the robot’s control stack or a perception library. It’s very important to use descriptive names for the high-level APIs so ChatGPT can reason about their behaviors;
- Next, we write a text prompt for ChatGPT which describes the task goal while also explicitly stating which functions from the high-level library are available. The prompt can also contain information about task constraints,
or how ChatGPT should form its answers (specific coding language, using auxiliary parsing elements); - The user stays on the loop to evaluate ChatGPT’s code output, either through direct inspection or using a simulator. If needed, the user uses natural language to provide feedback to ChatGPT on the answer’s quality and safety.
- When the user is happy with the solution, the final code can be deployed onto the robot.
Enough theory… What exactly can ChatGPT do?
Let’s take a look at a few examples… You can find even more case studies in our code repository (opens in new tab).
Zero-shot task planning
We gave ChatGPT access to functions that control a real drone, and it proved to be an extremely intuitive language-based interface between the non-technical user and the robot. ChatGPT asked clarification questions when the user’s instructions were ambiguous, and wrote complex code structures for the drone such as a zig-zag pattern to visually inspect shelves. It even figured out how to take a selfie! 📷 😎
We also used ChatGPT in a simulated industrial inspection scenario with the Microsoft AirSim simulator (opens in new tab). The model was able to effectively parse the user’s high-level intent and geometrical cues to control the drone accurately.
User on the loop: when a conversation is needed for a complex tasks
Next, we used ChatGPT in a manipulation scenario with a robot arm. We used conversational feedback to teach the model how to compose the originally provided APIs into more complex high-level functions: that ChatGPT coded by itself. Using a curriculum-based strategy, the model was able to chain these learned skills together logically to perform operations such as stacking blocks.
In addition, the model displayed a fascinating example of bridging the textual and physical domains when tasked with building the Microsoft logo out of wooden blocks. Not only was it able to recall the logo from its internal knowledge base, it was able to ‘draw’ the logo (as SVG code), and then use the skills learned above to figure out which existing robot actions can compose its physical form.
Next, we tasked ChatGPT to write an algorithm for a drone to reach a goal in space while not crashing into obstacles. We told the model that this drone has a forward facing distance sensor, and ChatGPT coded most of the key building blocks for the algorithm right away. This task required some conversation with the human, and we were impressed by ChatGPT’s ability to make localized code improvements using only language feedback.
Perception-action loops: robots that sense the world before they act
The ability to sense the world (perception) before doing something (action) is fundamental to any robotics system. Therefore, we decided to test ChatGPT’s understanding of this concept and asked it to explore an environment until finding a user-specified object. We gave the model access to functions such as object detection and object distance APIs, and verified that the code it generated successfully implemented a perception-action loop.
In experimental character, we ran additional experiments to evaluate if ChatGPT is able to decide where the robot should go based on sensor feedback in real time (as opposed to having ChatGPT generate a code loop that makes these decisions). Interestingly, we verified that we can feed a textual description of the camera image at each step into the chat, and the model was able to figure out how to control the robot until it reaches a particular object.
PromptCraft, a collaborative open-sourced tool for LLM+Robotics research
Good prompt engineering is crucial for the success of LLMs such as ChatGPT for robotics tasks. Unfortunately, prompting is an empirical science, and there is a lack of comprehensive and accessible resources with good (and bad) examples to help researchers and enthusiasts in the field. To address this gap, we introduce PromptCraft (opens in new tab), a collaborative open-source platform where anyone can share examples of prompting strategies for different robotics categories. We release all of the prompts and conversations used in this study. We invite the readers to contribute with more!
Besides prompt design, we hope to also include multiple robotics simulators and interfaces to allow users to test their ChatGPT-generated algorithms. As a start, we also release an AirSim environment with ChatGPT integration that anyone can use to get started with these ideas. We welcome contributions of new simulators and interfaces as well.
Bringing robotics out of labs, and into the world
We are excited to release these technologies with the aim of bringing robotics to the reach of a wider audience. We believe that language-based robotics control will be fundamental to bring robotics out of science labs, and into the hands of everyday users.
That said, we do emphasize that the outputs from ChatGPT are not meant to be deployed directly on robots without careful analysis. We encourage users to harness the power of simulations in order to evaluate these algorithms before potential real life deployments, and to always take the necessary safety precautions. Our work represents only a small fraction of what is possible within the intersection of large language models operating in the robotics space, and we hope to inspire much of the work to come.
Citation
If you find this work useful in your research, please cite us as
@techreport{vemprala2023chatgpt,
author = {Vemprala, Sai and Bonatti, Rogerio and Bucker, Arthur and Kapoor, Ashish},
title = {ChatGPT for Robotics: Design Principles and Model Abilities},
institution = {Microsoft},
year = {2023},
month = {February},
url = {https://www.microsoft.com/en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/},
number = {MSR-TR-2023-8},
}
This work is being undertaken by members of the Microsoft Autonomous Systems and Robotics Research Group. The researchers included in this project are: Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor.