Open-vocabulary generalization requires robotic systems to perform tasks involving complex and diverse envi- ronments and task goals. While the recent advances in vision language models (VLMs) present unprecedented opportunities to solve unseen problems, how to utilize their emergent capabilities to control robots in the physical world remains an open question. In this paper, we present Marking Open-vocabulary Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language de- scriptions. At the heart of our approach is a compact point- based representation of affordance and motion that bridges the VLM’s predictions on RGB images and the robot’s motions in the physical world. By prompting a VLM pre-trained on Internet-scale data, our approach predicts the affordances and generates the corresponding motions by leveraging the concept understanding and commonsense knowledge from broad sources. To scaffold the VLM’s reasoning in zero-shot, we propose a visual prompting technique that annotates marks on the images, converting the prediction of keypoints and waypoints into a series of visual question answering problems that are feasible for the VLM to solve. Using the robot experiences collected in this way, we further investigate ways to bootstrap the performance through in-context learning and policy distillation. We evaluate and analyze MOKA’s performance on a variety of manipulation tasks specified by free-form language descriptions, such as tool use, deformable body manipulation, and object rearrangement.
Given free-form descriptions of the tasks, MOKA can effectively predict the point-based affordance representations and generates the desired motions.
"Move the eyeglasses onto the yellow cloth and use the brush to sweep the snack package to the right side of the table."
(Subtask 1)
"Use the ultrasound cleaner to clean the metal watch. The unstrasound cleaner has no lid and can be turned on by pressing the red button."
(Subtask 1)
"Close the drawer."
"Move the eyeglasses onto the yellow cloth and use the brush to sweep the snack package to the right side of the table."
(Subtask 2)
"Use the ultrasound cleaner to clean the metal watch. The unstrasound cleaner has no lid and can be turned on by pressing the red button."
(Subtask 2)
"Insert the pink roses into the vase."
"Make a gift box containing the perfurme bottle. Put some golden filler beneath the perfume."
(Subtask 1)
"Unplug the charge cable and close the lid of the laptop."
(Subtask 1)
"Use the fur remover to remove the white fur ball on the sweater."
"Make a gift box containing the perfurme bottle. Put some golden filler beneath the perfume."
(Subtask 2)
"Unplug the charge cable and close the lid of the laptop."
(Subtask 2)
"Put the scissors in the hand."