Learning Reward for Physical Skills using Large Language Model

📝 Paper Summary

Reward Engineering Robotic Manipulation Inverse Reinforcement Learning

This paper enables robots to learn physical skills without expert demonstrations by using an LLM to propose a parameterized reward structure and then tuning those parameters by aligning the reward function's ranking of trajectories with the LLM's preferences.

Core Problem

Designing reward functions for physical skills is manual and tedious, while LLM-generated rewards often fail due to poor numerical reasoning and inability to incorporate environmental feedback.

Why it matters:

Manual reward engineering requires extensive domain knowledge and trial-and-error, bottlenecking the scaling of robotic skills
Inverse Reinforcement Learning (IRL) relies on expert demonstrations, which are costly and difficult to collect for high-precision physical tasks
Directly using LLMs to write reward code often results in unstable or physically infeasible rewards because LLMs lack grounding in physical dynamics

Concrete Example: In a 'pushing' task, a standard LLM might assign a low reward weight (e.g., 2.0) to the pushing action and a high weight to reaching. The robot exploits this by just touching the object without pushing it. The proposed method detects this sub-optimality via ranking and automatically increases the push weight to 21.05 to force the correct behavior.

Key Novelty

Iterative Self-Alignment for Reward Parameterization

Decomposes reward learning into two steps: (1) LLM generates a Python reward template with tunable hyperparameters (feature selection), and (2) An iterative loop tunes these parameters by asking the LLM to rank robot trajectories
Treats the LLM not just as a code generator, but as a 'discriminator' or pseudo-expert that provides preference feedback to guide the numerical optimization of the reward function
Uses a 'failure analysis' prompt when rankings agree but the task fails, explicitly asking the LLM to identify blocking factors and suggest parameter updates

Architecture

The Self-Alignment Reward Update process (Algorithm 1) showing the interaction between the policy, replay buffer, and LLM.

Evaluation Highlights

Touching task: Reaches 100% success in ~24,600 steps using Self-Alignment, compared to 93,200 steps for the fixed LLM-generated reward (approx. 3.8x faster)
Grasping task: Achieves 100% success in 19,000 steps, whereas the sparse reward baseline fails to learn the task entirely within the compute budget
Pushing task: Automatically corrects a sub-optimal 'touch-only' policy by identifying the need to increase pushing weight from 2.0 to 21.05 via preference alignment

Breakthrough Assessment

7/10

A clever integration of LLMs as both architects (code generation) and critics (ranking), addressing the specific weakness of LLM numerical reasoning in robotics. Limited by simulation-only validation.

⚙️ Technical Details

Problem Definition

Setting: Finite-horizon Markov Decision Process (MDP) without a ground-truth reward function or expert demonstrations

Inputs: Natural language task description (e.g., 'touch the block') and access to an LLM

Outputs: A parameterized reward function R_theta that induces an optimal policy

Pipeline Flow

Parameterization Proposal (LLM generates Python template)
Policy Learning (Inner Loop: DrQ-v2 trains on current reward)
Trajectory Sampling (Rollout policy & sample replay buffer)
Self-Alignment (Outer Loop: Update reward params via ranking)

System Modules

Reward Proposer

Generate reward function structure and initial parameters

Model or implementation: Large Language Model (specific version not reported)

Policy Learner

Learn the optimal policy given the current reward function

Model or implementation: DrQ-v2 (Off-policy RL)

Alignment Optimizer

Update reward parameters to match LLM trajectory preferences

Model or implementation: Metropolis-Hastings sampler (Bayesian inference)

Novel Architectural Elements

Double-loop optimization where the outer loop updates reward parameters based on LLM-generated rankings rather than expert demonstrations
Use of 'Failure Analysis' prompt loop: if rankings align but task fails, the LLM explicitly analyzes the block factor to suggest parameter updates

Modeling

Base Model: Large Language Model (specific version not reported in paper)

Training Method: Iterative Self-Alignment (Bayesian Inference via Metropolis-Hastings)

Objective Functions:

Purpose: Minimize discrepancy between learned reward ranking and LLM ranking.

Formally: L(theta) = - sum_{(tau_i, tau_j) in D} log P[tau_i > tau_j], where P is the Boltzmann-rational probability based on R_theta.

Key Hyperparameters:

beta: 0.9 (Boltzmann temperature)
reward_update_frequency: Every 10,000 steps

Compute: Not reported in the paper

Comparison to Prior Work

vs. IRL: Does not require expert demonstrations; uses LLM internal knowledge as the 'expert'
vs. Zero-shot LLM (e.g., Eureka/Text2Reward): Includes a feedback loop where the LLM critiques physical rollouts to tune numerical parameters, rather than just generating code once
vs. Pebble [cited in paper]: Uses LLM for ranking feedback instead of human feedback

Limitations

Relies on the LLM correctly understanding the task semantics from text descriptions
Requires one-time human intervention if the initial LLM-generated code has compilation errors
Evaluated only in simulation; real-world transfer not tested
Specific LLM model version not disclosed, affecting reproducibility

Reproducibility

No code repository provided. Full prompts are available in the Appendix. The specific LLM version (e.g., GPT-3.5 or GPT-4) is not explicitly stated in the text.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation skills in PyBullet simulation

Benchmarks:

Touching (Reach and maintain contact) [New]
Grasping and Lifting (Pick and lift object) [New]
Pushing to Goal (Push object to target coordinates) [New]

Metrics:

Success Rate (%)
Training Steps to Convergence
Statistical methodology: Experiments run over 5 different seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Touching	Steps to 100% Success	93200	24600	-68600
Grasping and Lifting	Steps to 100% Success	45000	19000	-26000
Pushing	Weight Value (Push vs Reach)	2.0	21.05	+19.05

Experiment Figures

Training curves (Success Rate vs. Environment Steps) for Touching, Grasping, and Pushing tasks.

Main Takeaways

Self-alignment significantly accelerates training compared to fixed LLM rewards (up to ~3.8x faster on Touching).
The method enables learning complex tasks (Grasping) where sparse rewards fail completely due to exploration challenges.
The feedback loop effectively corrects numerical hallucinations in LLM code (e.g., adjusting force thresholds from 1N to 0.127N).
LLMs can serve as effective discriminators (rankers) even if they struggle as direct generators of numerical reward values.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policies)
Inverse Reinforcement Learning (IRL)
Large Language Models (Prompting, CoT)

Key Terms

IRL: Inverse Reinforcement Learning—learning a reward function by observing expert behavior

CoT: Chain-of-Thought—a prompting technique where the model explains its reasoning step-by-step before giving a final answer

DrQ-v2: Data-Regularized Q-learning v2—an off-policy reinforcement learning algorithm designed for visual continuous control tasks

Metropolis-Hastings: A Markov Chain Monte Carlo (MCMC) algorithm used to sample from a probability distribution; used here to update reward parameters

Boltzmann-rational model: A probabilistic model assuming an agent (or LLM) is more likely to prefer a trajectory with higher total reward, used to interpret noisy rankings

Self-Alignment: The paper's method of using the LLM's own semantic understanding (via ranking) to correct the numerical parameters of the code it generated