ARR toppage

Applied Robotics Research

GPT Models Meet Robotic Applications: Long-Step Robot Control in Various Environments

Share this page

diagram
We have released practical prompts for ChatGPT to generate executable robot action sequences from multi-step human instructions in various environments.

Introduction

Imagine having a humanoid robot in your household that can be instructed and demonstrated household chores without coding—Our team has been developing such a system, which we call Learning-from-Observation.

As part of our effort, we recently released a paper, ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application (opens in new tab),” where we provide a specific example of how OpenAI’s ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. Our prompts and source code for using them are open-source and publicly available at this GitHub repository (opens in new tab).

In fact, generating programs for robots from language is an attractive goal and has attracted research interest in the robotics research community; some of them are built on top of large language models such as ChatGPT (opens in new tab). However, most of them were developed within a limited scope, hardware-dependent, or lack the functionality of human-in-the-loop. Additionally, most of these studies rely on a specific dataset, which requires data recollection and model retraining when transferring or extending them to other robotic scenes. From a practical application standpoint, an ideal robotic solution would be one that can be easily applied to other applications or operational settings without requiring extensive data collection or model retraining.

In this paper, we provide a specific example of how ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of actions that a robot can execute. In designing the prompts, we tried to ensure that they meet the requirements common to many practical applications while also being structured in a way that they can be easily customizable. The requirements we defined for this paper are:

  • Easy integration with robot execution systems or visual recognition programs.
  • Applicability to various home environments.
  • The ability to provide an arbitrary number of natural language instructions while minimizing the impact of ChatGPT’s token limit.

To meet these requirements, we designed input prompts to encourage ChatGPT to:

  • Output a sequence of predefined robot actions with explanations in a readable JSON format.
  • Represent the operating environment in a formalized style.
  • Infer and output the updated state of the operating environment, which can be reused as the next input, allowing ChatGPT to operate based solely on the memory of the latest operations.

We provide a set of prompt templates that structure the entire conversation for input into ChatGPT, enabling it to generate a response. The user’s instructions, as well as a specific explanation of the working environment, are incorporated into the template and used to generate ChatGPT’s response. For the second and subsequent instructions, ChatGPT’s next response is created based on all previous turns of the conversation, allowing ChatGPT to make corrections based on its own previous output and user feedback, if requested. If the number of input tokens exceeds the allowable limit for ChatGPT, we adjust the token size by truncating the prompt while retaining the most recent information about the updated environment.

Prompt flow
The entire structure of the conversation that will be inputted into ChatGPT for generating a response.

In our paper, we demonstrated the effectiveness of our proposed prompts in inferring appropriate robot actions for multi-stage language instructions in various environments. Additionally, we observed that ChatGPT’s conversational ability allows users to adjust its output with natural language feedback, which is crucial for developing an application that is both safe and robust while providing a user-friendly interface.

Integration with vision systems and robot controllers

Among recent experimental attempts to generate robot manipulation from natural language using ChatGPT, our work is unique in its focus on the generation of robot action sequences (i.e., “what-to-do”), while avoiding redundant language instructions to obtain visual and physical parameters (i.e., “how-to-do”), such as how to grab, how high to lift, and what posture to adopt. Although both types of information are essential for operating a robot in reality, the latter is often better presented visually than explained verbally. Therefore, we have focused on designing prompts for ChatGPT to recognize what-to-do, while obtaining the how-to-do information from human visual demonstrations and a vision system during robot execution.

As part of our efforts to develop a realistic robotic operation system, we have integrated the proposed system with a learning-from-observation system that includes a speech interface [ (opens in new tab)1 (opens in new tab)] (opens in new tab), [2] (opens in new tab), a visual teaching interface [3] (opens in new tab), a reusable library of robot actions [4] (opens in new tab), and a simulator for testing robot execution [5] (opens in new tab). If you are interested, please refer to the respective papers for the results of robot execution. The code for the teaching interface is available at another GitHub repository (opens in new tab).

graphical user interface, text, application
An example of integrating the proposed ChatGPT prompts into a robot teaching system. The system breaks down natural language input instructions into a sequence of robot actions, and then obtains the necessary parameters for robot execution (i.e., how to perform the actions) by prompting a human to visually demonstrate each step of the decomposed action sequence. An example of integrating the proposed ChatGPT-empowered task planner into a robot teaching system. A teaching system that incorporates the task planner (indicated by the dashed box). Following task planning, the system asks the user to visually demonstrate the tasks in a step-by-step manner. Visual parameters are then extracted from this visual demonstration.

Human demonstration and robot execution
(Top) The step-by-step demonstration corresponding to the planned tasks. (Middle and Bottom) Execution of the tasks by two different types of robot hardware. We have been developing a reusable library of robot skills (e.g., grab, pick up, bring, etc.) for several robot hardware. To learn more about the skill library, refer to our paper (opens in new tab).

Conclusion

The main contribution of this paper is the provision and publication of generic prompts for ChatGPT that can be easily adapted to meet the specific needs of individual experimenters. The impressive progress of large language models is expected to further expand their use in robotics. We hope that this paper provides practical knowledge to the robotics research community, and we have made our prompts and source code available as open-source material on this GitHub repository (opens in new tab).

Bibliography

@ARTICLE{10235949,
  author={Wake, Naoki and Kanehira, Atsushi and Sasabuchi, Kazuhiro and Takamatsu, Jun and Ikeuchi, Katsushi},
  journal={IEEE Access}, 
  title={ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application}, 
  year={2023},
  volume={11},
  number={},
  pages={95060-95078},
  doi={10.1109/ACCESS.2023.3310935}}

About our research group

Visit our homepage: Applied Robotics Research

Learn more about this project