DG policy shown within code block.
Large language models (LLMs) can provide rich physical descriptions of most worldly objects, allowing robots to achieve more informed and capable grasping. We leverage LLMs' common sense physical reasoning and code-writing abilities to infer an object's physical characteristics—mass m, friction coefficient µ, and spring constant k—from a semantic description, and then translate those characteristics into an executable adaptive grasp policy. Using a two-finger gripper with a built-in depth camera that can control its torque by limiting motor current, we demonstrate that LLM-parameterized but first-principles grasp policies outperform both traditional adaptive grasp policies and direct LLM-as-code policies on a custom benchmark of 12 delicate and deformable items including food, produce, toys, and other everyday items, spanning two orders of magnitude in mass and required pick-up force. We then improve property estimation and grasp performance on variable size objects with model finetuning on property-based comparisons and eliciting such comparisons via chain-of-thought prompting. We also demonstrate how compliance feedback from DeliGrasp policies can aid in downstream tasks such as measuring produce ripeness.
Large language models (LLMs) are able to supervise robot
control in manipulation across high-level step-by-step
task planning, low-level motion planning, and determining
grasp positions conditioned on a given object's
semantic properties. These methods inherently assume that the acts of “picking”
and “placing” are straightforward tasks, and cannot acount for
contact-rich manipulation tasks, such as
grasping a paper airplane, deformable plastic
bags containing dry noodles, or ripe produce.
In this work, we propose DeliGrasp, which leverages LLMs' .
common-sense physical reasoning
and code-writing abilities to infer the physical characteristics of gripper-object interactions,
including mass, spring constant, and friction, to obtain grasp policies
for these kinds of delicate and deformable objects. We formulate an adaptive grasp controller
with slip detection derived from the inferred characteristics,
endowing LLMs embodied with any current-controllable
gripper with adaptive, open-world grasp skills for objects
spanning a range of weight, size, fragility, and compliance.
The minimum grasp force required to pick an object up is bounded between object slip acceleration and gripper upwards acceleration (2.5 m/s2 for the UR5 robot arm), given an object's mass, m, and friction coefficient, µ.
We task an LLM (GPT-4) with predicting these quantities for an arbitrary object. To generate grasp policies, we leverage a dual-prompt structure similar to that of Language to Rewards, with an initial grasp “descriptor” prompt which estimates object characteristics and special accommodations, if needed, from the input object description. The “descriptor” prompt produces a structured description, which the subsequent “coder” prompt translates into an executable Python grasp policy that modulates gripper compliance, force, and aperture according to the controller described above.
We test DeliGrasp on a UR5 robot
arm and a MAGPIE gripper looking top-down on a table
with a palm-integrated Intel RealSense D-405 camera against
a dataset of 12 delicate and deformable objects.
We compare our method against 5 baselines:
1) in-place adaptive grasping with 2N and 10N grasp force thresholds,
2) in-motion adaptive grasping with 0.5N and 1.5N initial force,
3) a direct estimation baseline
in which DeliGrasp directly estimates the parameters of the adaptive grasping algorithm, contact force, force gain, and aperture gain,
4) a perception-informed baseline, where the gripper closes to an visually-determined object device-width, and
5) a traditional force-limited baseline, where the gripper closes until it cannot output any more force (set to 4 N).
We also include two additional DeliGrasp configurations: 1) with
PhysObjects finetuning and 2) with PhysObjects finetuning and chain-of-thought prompting.
As shown below, DeliGrasp performs equivalently or better than the baselines
on 10/12 objects and outperforms the baselines outright on 5/12 objects.
Where force-limited grasps deform objects, and visually-informed grasps slip, DeliGrasp
is successfully picks up objects with minimal deformation. While the direct estimation baseline is closer in performance,
we observe higher volatility in estimated parameters, due to a lack
of common-sense physical grounding, and thus, lower reliability and interpretability.
While the traditional adaptive grasping methods are more successful, particularly the "In Motion"
strategy, traditional methods are not as versatile as DeliGrasp
and are further dependent on their expert-tuned parameters for performance.
"In Motion" and "In Place" strategies with lower force settings are more successful
on lighter objects and the strategies with higher force settings are more successful
on the heavier objects. And all traditional methods crush highly delicate and light
objects like the raspberry. We conclude that it is the LLM's common
sense reasoning that enables such a dynamic grasping range in object mass and stiffness.
DeliGrasp failures are primarily slip failures, as the simple controller is not robust to non-linearly and/or highly compliant objects such
as the bag of noodles, bag of rice, spray bottle, and stuffed animal. Deforming failures did not occur, despite occasional mass overestimations,
because the applied force goalpoint, Fmin, is set to the estimated object's slip force rather than its maximum force. We show some of these failures below.
''' Estimated characteristics:
m: 200g
µ: 0.5
k: 500 N/m
F_min: 3.92 N
This grasp sets the initial force to a different initial force 0.5 Newton because of requiring gentle pressing to assess ripeness without causing indentations.
'''
''' User: Given this image of avocados and their corresponding spring constants, pick out the best avocado for guacamole today.'''
''' GPT-4V: Unfortunately, the image has a resolution that does not allow for a detailed inspection of the avocado's color. Avocado with spring constant k1 would be the best choice for guacamole today as it would be the softest and most ripe of the three, which is desirable for making guacamole.'''
How does this differ from Language to Rewards and Code as Policies?
DeliGrasp's primary function is the estimation of object mass, friction, and compliance, and modulating the computed grasp force depending on the high-level task description. We pair it with an adaptive grasping algorithm derived from traditional adaptive grasp controllers and tailored for these estimated characteristics. DeliGrasp parameterizes low-level contact rich manipulation, rather than directly design low-level motion with reward functions or pre-defined robot primitive skills.
How does the controller compare to a classic adaptive grasping methods?
What else can DeliGrasp pick up?
But can DeliGrasp pick up deli meats?
DeliGrasp: Grasp Descriptor | Grasp Descriptor with CoT Prompting | Grasp Coder
The authors would like to thank Eric Xue, Enora Rice, Shivendra Agrawal, James Watson, and Max Conway for their feedback and support.
The website template is adapted from Language to Rewards and ProgPrompt.