Although there is a growing demand for cooking behaviours as one of the expected tasks for robots, a series of cooking behaviours based on new recipe descriptions by robots in the real world has not yet been realised. In this study, we propose a robot system that integrates real-world executable robot cooking behaviour planning using the Large Language Model (LLM) and classical planning of PDDL descriptions, and food ingredient state recognition learning from a small number of data using the Vision-Language model (VLM). We succeeded in experiments in which PR2, a dual-armed wheeled robot, performed cooking from arranged new recipes in a real-world environment, and confirmed the effectiveness of the proposed system.
Real-world cooking robot system considering food state changes from recipe descriptions using foundation models and classical planning PDDL. The input recipe description is converted into the function sequence by the Large Language Model (LLM), and the executable action procedure is planned by classical planning of PDDL description from the sequence. The robot performs cooking actions while recognizing the state change of the ingredients by food state recognition learning from small data using the Vision-Language Model (VLM). Motion execution is performed using predefined action trajectories.
First, the input recipe description in natural language is converted into the cooking function sequence that can be interpreted by the robot using the few-shot prompting of Large Language Model (LLM). The black text in the figure shows the actual prompts, and its last recipe section depends on the natural language description of the recipe to be converted. The blue part is the result of the conversion that the LLM outputs. Next, rule-based processing transforms the cooking function sequence into corresponding target conditions for each step within the PDDL description. Finally, classical symbolic planning using the PDDL description is used to plan the complementary action steps so that they can be executed in the real environment.
We learn to recognize food state from a small number of data using CLIP's linear-probe for the image input of the gazing area. During inference, the learned model is used to infer the state of the foodstuff in real time, and the time when the inferred state first becomes the label after the state change is used as the timing of the state change.
The proposed system plans cooking actions that can be executed in the real world based on natural language recipe descriptions, and executes them in sequence while recognizing changes in the state of food ingredients during heating using the Vision-Language Model. In this experiment, the robot performed the motions that were created by a human by using direct teach, etc.
A new recipe for sunny-side up, arranged to use butter instead of oil, was cooked by the robot in the real world using the proposed system.
The cooking was executed in the same way for the unknown recipe of boiled and sauteed broccoli.
@article{kanazawa2024recipe, author={Naoaki Kanazawa, Kento Kawaharazuka, Yoshiki Obinata, Kei Okada and Masayuki Inaba}, title={{Real-world cooking robot system from recipes based on food state recognition using foundation models and PDDL}}, journal={Advanced Robotics}, volume = {38}, number = {18}, pages = {1318--1334}, year = {2024}, publisher = {Taylor \& Francis}, doi = {10.1080/01691864.2024.2407136}, }
If you have any questions, please feel free to contact Naoaki Kanazawa.