Evaluation Setup
Survey of existing literature spanning multiple domains
Benchmarks:
- Various Control Tasks (Continuous Control / Robotics)
- LLM Fine-tuning (Language Modeling)
- Atari Games (Discrete Control)
Metrics:
- Alignment with human intent
- Sample efficiency (number of human queries)
- Robustness to reward hacking
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- RLHF has successfully scaled from simple control tasks to complex LLM fine-tuning, validating the reward modeling approach.
- The field is moving towards fusing multiple feedback types (e.g., combining demonstrations with preferences) to leverage their relative strengths.
- Active learning and query synthesis are critical for making RLHF practical by reducing the volume of feedback required from humans.
- A key open challenge is the theoretical understanding of when and why RLHF works, as well as addressing the potential for agents to exploit errors in human judgment.