← Back to Paper List

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
NVIDIA
arXiv (2025)
RL Benchmark Factuality Reasoning

📝 Paper Summary

Reward Modeling Reinforcement Learning from Human Feedback (RLHF) Reinforcement Learning with Verifiable Rewards (RLVR)
RLBFF aligns language models by converting natural language feedback into thousands of fine-grained binary principles (yes/no), combining the versatility of human preferences with the precision of verifiable rewards.
Core Problem
RLHF suffers from interpretability issues and reward hacking due to vague criteria, while RLVR is limited to narrow domains with strictly verifiable correctness (like math/code), leaving a gap for nuanced but precise feedback.
Why it matters:
  • Standard reward models (Bradley-Terry) produce uncalibrated scores that lack explanation, making it hard to know *why* a model prefers a response
  • Human feedback often relies on implicit principles (e.g., 'hilarious' vs. 'correct'), and optimizing without explicit principles makes training less effective
  • Likert scales (1-5) are hard to calibrate across different annotators, leading to noisy training signals
Concrete Example: A verifier might reject a correct answer like '180 minutes' if the reference is '3 hours' (low recall). Conversely, an RLHF model might reward a response just for being long, even if incorrect (reward hacking). RLBFF defines explicit binary checks like 'Is the response concise? (Yes/No)' to avoid these pitfalls.
Key Novelty
Reinforcement Learning with Binary Flexible Feedback (RLBFF)
  • Extracts over 1,000 distinct principles (e.g., 'clarity', 'code readability') from human-written feedback using an LLM, converting qualitative comments into binary traits
  • Trains a Reward Model to predict whether a response satisfies a specific principle (Entailment) rather than just predicting a generic preference score
  • Allows users to specify or swap principles at inference time to customize model behavior, unlike static Bradley-Terry models
Evaluation Highlights
  • Achieves 81.4% on JudgeBench, ranking #1 on the leaderboard as of September 24, 2025
  • Outperforms Bradley-Terry models on RM-Bench with a score of 86.2% when matched for data
  • Aligns Qwen3-32B to match or exceed proprietary models like o3-mini and DeepSeek R1 on MT-Bench, WildBench, and Arena Hard v2
Breakthrough Assessment
9/10
Significantly bridges the gap between RLHF and RLVR by successfully scaling binary verification to general domains. The release of a #1 leaderboard reward model and a full alignment recipe strengthens the contribution.
×