DG policy shown within code block.
Large language models (LLMs) can provide rich physical descriptions of most worldly objects, allowing robots to achieve more informed and capable grasping. We leverage LLMs' common sense physical reasoning and code-writing abilities to infer an object's physical characteristics—mass m, friction coefficient µ, and spring constant k—from a semantic description, and then translate those characteristics into an executable adaptive grasp policy. Using a current-controllable, two-finger gripper with a built-in depth camera, we demonstrate that LLM-generated, physically-grounded grasp policies outperform traditional grasp policies on a custom benchmark of 12 delicate and deformable items including food, produce, toys, and other everyday items, spanning two orders of magnitude in mass and required pick-up force. We also demonstrate how compliance feedback from DeliGrasp policies can aid in downstream tasks such as measuring produce ripeness.
Large language models (LLMs) are able to supervise robot
control in manipulation across high-level step-by-step
task planning, low-level motion planning, and determining
grasp positions conditioned on a given object's
semantic properties. These methods inherently assume that the acts of “picking”
and “placing” are straightforward tasks, and cannot acount for
contact-rich manipulation tasks, such as
grasping a paper airplane, deformable plastic
bags containing dry noodles, or ripe produce.
In this work, we propose DeliGrasp, which leverages LLMs' .
common-sense physical reasoning
and code-writing abilities to infer the physical characteristics of gripper-object interactions,
including mass, spring constant, and friction, to obtain grasp policies
for these kinds of delicate and deformable objects. We formulate an adaptive grasp controller
with slip detection derived from the inferred characteristics,
endowing LLMs embodied with any current-controllable
gripper with adaptive, open-world grasp skills for objects
spanning a range of weight, size, fragility, and compliance.
The minimum grasp force required to pick an object up is bounded between object slip acceleration and gripper upwards acceleration (2.5 m/s2 for the UR5 robot arm), given an object's mass, m, and friction coefficient, µ.
We task an LLM (GPT-4) with predicting these quantities for an arbitrary object. To generate grasp policies, we leverage a dual-prompt structure similar to that of Language to Rewards, with an initial grasp “descriptor” prompt which estimates object characteristics and special accommodations, if needed, from the input object description. The “descriptor” prompt produces a structured description, which the subsequent “coder” prompt translates into an executable Python grasp policy that modulates gripper compliance, force, and aperture according to the controller described above.
We test DeliGrasp on a UR5 robot
arm and a MAGPIE gripper looking top-down on a table
with a palm-integrated Intel RealSense D-405 camera against
a dataset of 12 delicate and deformable objects.
We compare our method against 3 baselines: 1) a direct estimation baseline
in which DeliGrasp directly estimates the parameters of the adaptive grasping algorithm, contact force, force gain, and aperture gain
2) a perception-informed baseline, where the gripper closes to an visually-determined object device-width
3) a traditional force-limited baseline, where the gripper closes until it cannot output any more force (set to 4 N)
As shown below, DeliGrasp outperforms the baselines on 8/12 objects.
Where force-limited grasps deform objects, and visually-informed grasps slip, DeliGrasp
is successfully picks up objects with minimal deformation. While the direct estimation baseline is closer in performance,
we observe higher volatility in estimated parameters, due to a lack
of common-sense physical grounding, and thus, lower reliability and interpretability.
DeliGrasp failures are primarily slip failures, as the simple controller is not robust to non-linearly and/or highly compliant objects such
as the bag of noodles, bag of rice, spray bottle, and stuffed animal. Deforming failures did not occur, despite occasional mass overestimations,
because the applied force goalpoint, Fmin, is set to the estimated object's slip force rather than its maximum force. We show some of these failures below.
''' Estimated characteristics:
m: 200g
µ: 0.5
k: 500 N/m
F_min: 3.92 N
This grasp sets the initial force to a different initial force 0.5 Newton because of requiring gentle pressing to assess ripeness without causing indentations.
'''
''' User: Given this image of avocados and their corresponding spring constants, pick out the best avocado for guacamole today.'''
''' GPT-4V: Unfortunately, the image has a resolution that does not allow for a detailed inspection of the avocado's color. Avocado with spring constant k1 would be the best choice for guacamole today as it would be the softest and most ripe of the three, which is desirable for making guacamole.'''
How does this differ from Language to Rewards and Code as Policies?
DeliGrasp's primary function is the estimation of object mass, friction, and compliance, and modulating the computed grasp force depending on the high-level task description. We pair it with an adaptive grasping algorithm derived from traditional adaptive grasp controllers and tailored for these estimated characteristics. DeliGrasp parameterizes low-level contact rich manipulation, rather than directly design low-level motion with reward functions or pre-defined robot primitive skills.
How does the controller compare to a classic adaptive grasping methods?
What else can DeliGrasp pick up?
But can DeliGrasp pick up deli meats?
DeliGrasp: Grasp Descriptor | Grasp Coder
The authors would like to thank Eric Xue, Enora Rice, Shivendra Agrawal, and James Watson for their feedback and support.
The website template is adapted from Language to Rewards and ProgPrompt.