← Back to Paper List

A Survey of Reinforcement Learning from Human Feedback

Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier
arXiv.org (2023)
RL P13N Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Preference-based Reinforcement Learning (PbRL) AI Alignment
This survey unifies Preference-based RL and RLHF into a single framework, categorizing methods by feedback type, reward learning, and policy optimization across robotics and language domains.
Core Problem
Designing effective reward functions is difficult due to sparse signals and the risk of spurious correlations, while learning from demonstrations is limited by human performance and data availability.
Why it matters:
  • Manually engineered rewards often lead to 'reward hacking,' where agents maximize numbers without achieving the intended goal (e.g., a vacuum cleaner maximizing dust collection by dumping and re-collecting dust)
  • In safety-critical domains like healthcare or autonomous driving, misaligned rewards can cause physical harm or dangerous behavior
  • Inverse RL struggles to outperform human demonstrators or work when demonstrations are difficult to provide
Concrete Example: In gaming environments, agents have been observed prematurely exiting games to avoid negative rewards or exploiting simulation bugs for points. In safety contexts, a care robot optimizing a poor reward function could cause injuries while technically maximizing its score.
Key Novelty
Unification of PbRL and RLHF
  • Proposes that RLHF is a generalization of Preference-based RL (PbRL); whereas PbRL focused on relative feedback (rankings), RLHF includes broader feedback types
  • Decomposes the RLHF problem into three distinct components: Feedback Collection, Reward Model Learning, and Policy Optimization
  • Extends the survey scope beyond LLMs to include foundational techniques from control theory and robotics which are often overlooked in recent literature
Evaluation Highlights
  • The survey reviews applications across diverse domains including continuous control, robotics, image generation, and LLM fine-tuning
  • Highlights methodological advances in query efficiency (active learning) and feedback fusion (combining multiple feedback types)
  • Identifies that RLHF addresses alignment issues better than inverse RL by allowing iterative refinement of objectives
Breakthrough Assessment
8/10
A comprehensive foundational survey that clarifies the confused terminology between PbRL and RLHF and bridges the gap between modern LLM techniques and older control theory research.
×