Reward Design with Language Models

📝 Paper Summary

Reward Design Alignment Reinforcement Learning

This paper enables users to train reinforcement learning agents on abstract objectives by using frozen large language models as proxy reward functions via natural language prompts.

Core Problem

Defining rewards for abstract human preferences (like 'versatility' or 'fairness') is difficult to encode mathematically and collecting large datasets for learned rewards is expensive.

Why it matters:

Hand-crafting reward functions for complex behaviors is non-intuitive and prone to reward hacking, where agents exploit specification errors.
Learning rewards typically requires large amounts of labeled expert data, which does not generalize to new users with different objectives.
Existing methods like RLHF require fine-tuning models, which is computationally expensive compared to using frozen models.

Concrete Example: A user wants a 'versatile' negotiation agent. Writing a mathematical formula for 'versatility' is hard. Collecting thousands of examples of versatile vs. non-versatile behavior is costly. Consequently, agents trained with standard rewards often fail to capture this nuance.

Key Novelty

LLM-as-a-Proxy-Reward (frozen prompting)

Uses a frozen Large Language Model (LLM) as a reward function by prompting it with a task description, a few examples of desired behavior, and the agent's current action.
Leverages the LLM's pre-trained commonsense priors about human behavior to perform zero-shot or few-shot evaluation of agent trajectories without fine-tuning.

Architecture

The RL training loop using an LLM as a proxy reward function. It illustrates how user prompts and episode outcomes are fed to the LLM to generate a binary reward.

Evaluation Highlights

Outperforms Supervised Learning (SL) baselines by an average of 46% in training objective-aligned agents for the complex DealOrNoDeal negotiation task.
Achieves 3.72/5 user alignment rating in a human study, significantly higher than agents trained with opposite styles (1.56/5).
Demonstrates zero-shot capability in Matrix Games, improving reward labeling accuracy by 48% over a 'No Objective' baseline.

Breakthrough Assessment

8/10

A significant step in democratizing reward design. It effectively replaces complex reward engineering with natural language prompting, showing strong empirical results across varying task complexities.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) M = <S, A, p, R, gamma>, where the reward function R is replaced by an LLM-based proxy.

Inputs: A text prompt rho containing: task description, user objective (examples/description), and the text representation of an RL episode.

Outputs: A binary reward signal (0 or 1) derived from the LLM's text output.

Pipeline Flow

User defines objective via Prompt (Description/Examples)
RL Agent interacts with Environment -> Generates Trajectory
Trajectory Parser converts state/action to Text
LLM receives Prompt + Trajectory Text -> Outputs Assessment ('Yes'/'No')
Reward Parser converts Assessment to Integer Reward
RL Agent updates policy using Integer Reward

System Modules

Prompt Construction

Combine task description, user objective, and episode outcome into a single text prompt

Model or implementation: Template-based concatenation

Proxy Reward (Evaluation)

Evaluate if the agent's behavior matches the user's objective

Model or implementation: GPT-3 (text-davinci-002)

Reward Parser (Evaluation)

Convert LLM text output into a numerical reward signal

Model or implementation: Handcrafted mapping g

RL Agent

Learn policy to maximize the proxy reward

Model or implementation: DQN (Ultimatum/Matrix) or On-policy RL (DealOrNoDeal)

Novel Architectural Elements

Substitution of the scalar reward function R in an MDP with a frozen LLM prompted by natural language and examples
Feedback loop where the LLM's inference output is parsed directly into a reward signal for RL training without fine-tuning the LLM

Modeling

Base Model: GPT-3 (text-davinci-002)

Comparison to Prior Work

vs. RLHF: Uses frozen LLMs and in-context learning (few/zero-shot) instead of fine-tuning on large datasets
vs. Language-Guided RL: Focuses on high-level behavioral properties (e.g., fairness, versatility) rather than specific navigational subtasks
vs. Supervised Learning Baseline: Outperforms SL trained on the same small number of examples, demonstrating better data efficiency

Limitations

Requires prompt engineering; performance depends on the quality of descriptions and examples.
Parsing LLM output into rewards is currently done with handcrafted, task-specific parsers.
Only produces binary rewards (0 or 1) in the current implementation, ignoring probability information.
Relies on the capabilities and priors of the specific LLM used (GPT-3).

Reproducibility

Prompt templates and examples are provided in the appendix (Figs 10-13). The paper states 'Code and prompts can be found here' but provides no URL in the text. Model used is GPT-3 text-davinci-002 (API-based, closed source). Training hyperparameters (steps, seeds) are listed.

📊 Experiments & Results

Evaluation Setup

RL agents trained using LLM-generated rewards on three domains: Ultimatum Game, Matrix Games, and DealOrNoDeal negotiation.

Benchmarks:

Ultimatum Game (Resource division game (Fairness/Inequity Aversion))
Matrix Games (2-player normal-form games (Battle of Sexes, Stag Hunt, etc.))
DealOrNoDeal (Complex negotiation dialogue (Long-horizon))

Metrics:

Labeling Accuracy (LLM vs Ground Truth Reward)
RL Agent Accuracy (Policy vs Ground Truth Objective)
User Rating (Likert scale 1-5)
Statistical methodology: Reported means across 3 random seeds. Pilot study uses statistical significance (p < 0.001).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results for the DealOrNoDeal task showing the benefit of LLM rewards over Supervised Learning baselines in a complex, few-shot setting.
DealOrNoDeal	RL Agent Accuracy	0.50	0.96	+0.46
DealOrNoDeal	RL Agent Accuracy	1.00	0.96	-0.04
Results for Matrix Games demonstrating Zero-shot capability.
Matrix Games	Labeling Accuracy (Regular Order)	0.19	0.67	+0.48
Matrix Games	RL Agent Accuracy	0.54	0.88	+0.34
Human evaluation results.
DealOrNoDeal (Pilot)	User Alignment Rating (1-5)	1.56	3.72	+2.16

Experiment Figures

Comparison of labeling accuracy and RL agent accuracy in the Ultimatum Game for Few-shot settings (10 examples vs 1 example).

Performance on DealOrNoDeal negotiation task: Labeling accuracy, RL agent accuracy, and Pilot User Study ratings.

Main Takeaways

LLMs can effectively serve as proxy reward functions using only a few examples (few-shot) or just a description (zero-shot), outperforming supervised baselines on limited data.
In the Ultimatum Game, a single example *with explanation* allows the LLM to maintain high accuracy, whereas supervised models fail, highlighting the data efficiency of in-context learning.
For well-known concepts (Matrix Games), LLMs can zero-shot objective-aligned rewards, though scrambling the input format (actions/payoffs) reduces performance, suggesting reliance on pre-training distribution.
The framework scales to longer-horizon tasks (DealOrNoDeal), producing agents that align with human-perceived styles (e.g., 'Stubborn', 'Versatile') better than baselines.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Rewards)
Large Language Models (In-context learning, Prompting)
Game Theory (Ultimatum Game, Nash Equilibrium)

Key Terms

Proxy Reward Function: A function (here, an LLM) used to approximate the true, desired reward signal when the ground truth reward is hard to specify.

In-context Learning: The ability of a language model to perform a task given only a few examples (prompts) in its input, without updating its weights.

Zero-shot Prompting: Asking the model to perform a task using only a description, without providing any specific examples.

Few-shot Prompting: Providing the model with a small number (e.g., 1-10) of input-output examples to guide its behavior.

Ultimatum Game: A game where a Proposer offers a split of resources and a Responder accepts or rejects it; often used to study fairness.

Pareto-optimality: A state where no individual's situation can be improved without making another individual's situation worse.

RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune models using rewards learned from human preference data.

DQN: Deep Q-Network—a value-based reinforcement learning algorithm that uses deep neural networks to estimate Q-values.

Parsers: Functions defined in this paper to convert environment states to text strings (input to LLM) and LLM text outputs to integers (reward for RL).