A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning

📝 Paper Summary

LLM-driven Reward Design Automated Reinforcement Learning Reward Engineering

CARD is an automated framework that uses a Coder-Evaluator architecture to iteratively design and refine RL reward functions using dynamic feedback and trajectory preferences without human intervention.

Core Problem

Manually designing reward functions for RL is difficult and expensive, while existing LLM-based methods suffer from hallucinations or require extensive human feedback and repetitive, costly RL training loops.

Why it matters:

Real-world tasks often lack well-defined reward functions, blocking RL adoption.
Inverse RL requires expensive expert demonstrations which are hard to obtain.
Existing LLM methods waste tokens on parallel sampling or require humans to manually correct code or analyze trajectories.

Concrete Example: In a robotic manipulation task, an LLM might generate a reward function that compiles but fails to encourage the correct movement. Previous methods would re-run the full RL training to find this out, or ask a human to debug. CARD detects the issue via trajectory analysis or preference checks before full training commits.

Key Novelty

Coder-Evaluator Reward Design (CARD)

Splits the design process into a Coder (generates/refines code) and an Evaluator (analyzes performance), simulating a developer-tester loop.
Introduces Trajectory Preference Evaluation (TPE) to filter poor reward functions by checking if successful trajectories actually get higher rewards than failed ones, skipping unnecessary RL training.

Architecture

The overall CARD framework pipeline, illustrating the interaction between the Coder and Evaluator.

Evaluation Highlights

Surpasses human oracle performance on 3 tasks (e.g., Push-Wall, Pick-Place) in Meta-World and ManiSkill2.
Achieves better or comparable performance to expert-designed rewards on 10 out of 12 tested tasks.
Significantly reduces token consumption and training time compared to Eureka (SOTA baseline) by avoiding parallel sampling and unnecessary training runs.

Breakthrough Assessment

8/10

Strong methodological contribution with the TPE mechanism which efficiently prunes the search space, addressing the major bottleneck of computational cost in LLM-based RL.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) defined by (S, A, P, R, gamma), where the reward function R is unknown and must be generated as code.

Inputs: Environment description code (Pythonic style) and natural language task description.

Outputs: Executable Python code for a reward function R(s, a) that maximizes task success.

Pipeline Flow

Initialization: Coder generates initial reward code from Env/Task description
Static Check: Code is verified for syntax/runtime errors via lightweight test
Reward Introspection Loop:
Evaluator -> TPE Check -> (If Pass) RL Training -> Process/Trajectory Feedback
Evaluator -> TPE Check -> (If Fail) Preference Feedback (Skip Training)
Reward Improvement: Coder refines code based on feedback

System Modules

Coder

Generates and refines reward function code based on prompts and feedback

Model or implementation: GPT-4

Evaluator

Assesses code quality via TPE or RL training and generates structured feedback

Model or implementation: Deterministic Python Script (not an LLM)

Novel Architectural Elements

Trajectory Preference Evaluation (TPE) module that acts as a gatekeeper before RL training to verify order-preserving property of rewards.
Dual-feedback loop mechanism where feedback type switches dynamically between 'Preference' (fast, no train) and 'Process/Trajectory' (slow, post-train) based on validation results.

Modeling

Base Model: GPT-4 (gpt-4-0613) for Coder; Soft Actor-Critic (SAC) for RL agent

Training Method: Soft Actor-Critic (SAC) for the RL agent; Iterative prompting for the Reward Function

Objective Functions:

Purpose: Maximize expected return using generated reward.

Formally: Standard RL objective J(π) = E[sum(gamma^t * r_t)]
Purpose: Verify order-preserving property of reward.

Formally: Check if AverageReturn(Successful_Traj) > AverageReturn(Failed_Traj)

Key Hyperparameters:

rl_algorithm: SAC
num_envs: 1 (ManiSkill2), 10 (Meta-World)
train_steps_per_iteration: 1,000,000 (ManiSkill2), 500,000 (Meta-World)
+ 3 more
max_iterations: 5
samples_per_iteration: 1 (CARD), 16 (Eureka baseline)
temperature: 1.0 (LLM generation)

Compute: Single NVIDIA RTX 3090 or RTX 4090 GPU used for experiments.

Comparison to Prior Work

vs. Eureka: CARD uses serial iterative improvement with dynamic feedback (TPE) rather than parallel evolutionary search, reducing token/training costs.
vs. Text2Reward: CARD is fully automated without human feedback loops.
vs. Code-as-Reward [not cited in paper]: Focuses on dynamic feedback refinement rather than just initial code generation.

Limitations

Depends on the quality of the proprietary LLM (GPT-4); performance drops with weaker models.
Requires an initial set of trajectories (or one training run) to kickstart the TPE process.
Feedback generation is rule-based and might miss semantic nuances that an LLM-based evaluator could catch.

Reproducibility

Code: https://github.com/ShengjieSun29/CARD

Code is publicly available at https://github.com/ShengjieSun29/CARD. The paper provides prompts in Appendix C. Environment details for Meta-World and ManiSkill2 are standard. GPT-4 API costs are noted as a factor.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation tasks in simulation.

Benchmarks:

Meta-World (Robotic manipulation (10 tasks e.g., Reach, Push, Drawer-Open))
ManiSkill2 (Robotic manipulation (2 tasks: PickCube, TurnFaucet))

Metrics:

Success Rate
Token Consumption
Training Time
Statistical methodology: Results averaged over 5 random seeds (Meta-World) or 3 random seeds (ManiSkill2). Standard deviation reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons on Meta-World tasks showing CARD matches or exceeds baselines.
Meta-World (Push Wall)	Success Rate	0.84	1.00	+0.16
Meta-World (Drawer Close)	Success Rate	0.98	1.00	+0.02
ManiSkill2 (Pick Cube)	Success Rate	0.68	0.94	+0.26
Meta-World (Reach)	Success Rate	1.00	1.00	0.00
Efficiency analysis showing CARD uses fewer resources.
Meta-World (Avg across tasks)	Token Usage (Total Tokens)	39097	15370	-23727

Experiment Figures

Ablation study on different feedback types (w/o Process, w/o Trajectory, w/o Preference).

Main Takeaways

CARD consistently outperforms or matches SOTA baselines (Eureka, L2R) across diverse robotic manipulation tasks.
The TPE mechanism effectively reduces computational costs by filtering out poor reward functions before expensive RL training.
The framework demonstrates that iterative refinement with specific feedback (process/trajectory) is more token-efficient than parallel sampling (evolutionary) approaches.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (MDP, Policy Gradients)
Large Language Models (Prompting, Code Generation)
Reward Engineering concepts

Key Terms

TPE: Trajectory Preference Evaluation—a mechanism to check if a generated reward function correctly assigns higher returns to successful trajectories than failed ones, used to filter code before training.

SAC: Soft Actor-Critic—an off-policy reinforcement learning algorithm used as the underlying RL solver for the tasks.

Meta-World: A benchmark environment for robotic manipulation tasks involving a Sawyer robot arm.

ManiSkill2: A simulation benchmark for generalizable robot manipulation skills.

Process Feedback: Feedback generated from the training curve statistics (return, success rate, sub-reward values) to guide the LLM.

Trajectory Feedback: Feedback generated by analyzing specific step-by-step details of successful and failed rollout trajectories.

Preference Feedback: Feedback provided when a reward function fails the TPE check, explaining that successful trajectories were not ranked higher than failed ones.