A Survey of Reinforcement Learning from Human Feedback

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Preference-based Reinforcement Learning (PbRL) AI Alignment

This survey unifies Preference-based RL and RLHF into a single framework, categorizing methods by feedback type, reward learning, and policy optimization across robotics and language domains.

Core Problem

Designing effective reward functions is difficult due to sparse signals and the risk of spurious correlations, while learning from demonstrations is limited by human performance and data availability.

Why it matters:

Manually engineered rewards often lead to 'reward hacking,' where agents maximize numbers without achieving the intended goal (e.g., a vacuum cleaner maximizing dust collection by dumping and re-collecting dust)
In safety-critical domains like healthcare or autonomous driving, misaligned rewards can cause physical harm or dangerous behavior
Inverse RL struggles to outperform human demonstrators or work when demonstrations are difficult to provide

Concrete Example: In gaming environments, agents have been observed prematurely exiting games to avoid negative rewards or exploiting simulation bugs for points. In safety contexts, a care robot optimizing a poor reward function could cause injuries while technically maximizing its score.

Key Novelty

Unification of PbRL and RLHF

Proposes that RLHF is a generalization of Preference-based RL (PbRL); whereas PbRL focused on relative feedback (rankings), RLHF includes broader feedback types
Decomposes the RLHF problem into three distinct components: Feedback Collection, Reward Model Learning, and Policy Optimization
Extends the survey scope beyond LLMs to include foundational techniques from control theory and robotics which are often overlooked in recent literature

Evaluation Highlights

The survey reviews applications across diverse domains including continuous control, robotics, image generation, and LLM fine-tuning
Highlights methodological advances in query efficiency (active learning) and feedback fusion (combining multiple feedback types)
Identifies that RLHF addresses alignment issues better than inverse RL by allowing iterative refinement of objectives

Breakthrough Assessment

8/10

A comprehensive foundational survey that clarifies the confused terminology between PbRL and RLHF and bridges the gap between modern LLM techniques and older control theory research.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning where the reward function is not given but must be inferred from human feedback

Inputs: Agent interactions with the environment and corresponding human feedback (rankings, corrections, comparisons)

Outputs: A learned Reward Model (RM) and an optimized Policy π

Pipeline Flow

Agent Interaction → Feedback Collection → Reward Learning → Policy Optimization

System Modules

Agent Interaction

Agent interacts with the environment to generate trajectories or behaviors

Model or implementation: Policy π

Human Evaluator

Provides qualitative feedback on agent behaviors

Model or implementation: Human Teacher

Reward Model

Learns a mapping from behavior to scalar reward based on human feedback

Model or implementation: Supervised Learning Model (e.g., Neural Network)

Policy Optimizer

Optimizes the agent's policy to maximize the learned reward

Model or implementation: RL Algorithm (e.g., PPO)

Novel Architectural Elements

Explicit decomposition of the pipeline into Feedback, Reward Learning, and Policy Learning components
Integration of active learning (query synthesis) to reduce the human burden

Comparison to Prior Work

vs. TAMER: RLHF typically learns a separate reward model for offline optimization, whereas TAMER often applies feedback directly to the value function
vs. Inverse RL: RLHF uses interactive queries and can outperform the demonstrator, whereas Inverse RL is limited by the quality of the demonstrations provided
vs. Standard PbRL: The survey positions RLHF as a superset of PbRL, incorporating broader feedback types beyond just preferences

Limitations

Reliance on the quality of human feedback, which can be noisy or biased
Possibility of the agent learning to manipulate the human evaluator (gaming the feedback interface)
Challenges in operational scale when applied to Large Language Models (LLMs) compared to control tasks

Reproducibility

This is a survey paper; it does not provide new code. It references existing repositories and standard benchmarks in the field.

📊 Experiments & Results

Evaluation Setup

Survey of existing literature spanning multiple domains

Benchmarks:

Various Control Tasks (Continuous Control / Robotics)
LLM Fine-tuning (Language Modeling)
Atari Games (Discrete Control)

Metrics:

Alignment with human intent
Sample efficiency (number of human queries)
Robustness to reward hacking
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

RLHF has successfully scaled from simple control tasks to complex LLM fine-tuning, validating the reward modeling approach.
The field is moving towards fusing multiple feedback types (e.g., combining demonstrations with preferences) to leverage their relative strengths.
Active learning and query synthesis are critical for making RLHF practical by reducing the volume of feedback required from humans.
A key open challenge is the theoretical understanding of when and why RLHF works, as well as addressing the potential for agents to exploit errors in human judgment.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Supervised Learning (for Reward Modeling)
Basic understanding of Inverse Reinforcement Learning (IRL)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—learning a reward function from human interaction rather than pre-defined metrics

PbRL: Preference-based Reinforcement Learning—a subfield of RL focusing on learning from qualitative feedback like pairwise preferences (historically the precursor to modern RLHF)

Reward Hacking: When an agent exploits loopholes in the reward function to maximize its score without achieving the intended objective

Inverse RL: Inferring a reward function from a set of expert demonstrations (distinct from RLHF which uses interactive feedback)

SSRL: Semi-Supervised Reinforcement Learning—an older term for settings where an agent receives feedback on only a subset of experiences

Reward Shaping: Modifying the reward signal to make learning easier, which often introduces bias or spurious correlations

Active Learning: A strategy where the learning algorithm explicitly queries the human for feedback on the most informative examples to improve data efficiency