Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, S. Rajmohan, Qi Zhang
School of Computer Science, Fudan University,
School of Computer Science, Peking University,
Microsoft
arXiv.org
(2024)
RLP13N
📝 Paper Summary
Reinforcement Learning from Human Feedback (RLHF)Reinforcement Learning from AI Feedback (RLAIF)
A self-evolved reward learning framework iteratively refines a reward model using its own high-confidence feedback on unlabeled data, achieving strong performance with minimal human labels.
Core Problem
Training reliable reward models for RLHF typically requires massive amounts of high-quality human preference data, which is expensive, limited, and hard to scale.
Why it matters:
The scalability of strong LLMs is bottlenecked by the scarcity and cost of human-annotated preference data required for alignment
Current RLAIF methods often rely on stronger, external LLMs (like GPT-4) for feedback, rather than allowing a model to self-improve efficiently
Quality of reward models directly dictates the success of reinforcement learning strategies like PPO; poor reward models lead to poor policy optimization
Concrete Example:In standard RLHF, if a developer only has 10% of the necessary human labels, the resulting reward model will be too noisy to guide PPO effectively, causing the LLM to generate misaligned or low-quality responses.
Key Novelty
Self-Evolved Reward Learning (SER)
Iterative 'feedback-then-train' loop where the Reward Model (RM) labels its own data and retrains on high-confidence samples
Curriculum-style learning status detection: The RM first learns to distinguish 'good vs. bad' (Status 1) before progressing to finer-grained comparisons between similar answers (Status 2)
Adaptive data filtering that selects training samples based on the model's current capability (distinguishing broad quality vs. nuanced differences)
Architecture
The Self-Evolved Reward Learning (SER) pipeline. It depicts the iterative loop of self-labeling, status identification, data filtering, and retraining.
Evaluation Highlights
Achieves performance comparable to models trained on full human datasets using only 15% of the annotated seed data
Improves model performance by an average of 7.88% compared to seed models trained on limited human data
Convergence analysis shows the method can surpass performance of models trained on the entire human-annotated dataset after multiple iterations
Breakthrough Assessment
8/10
Significantly reduces reliance on human data (using only 15%) while maintaining or exceeding full-data performance. The curriculum-based self-labeling strategy is a robust contribution to data-efficient alignment.
⚙️ Technical Details
Problem Definition
Setting: Training a Reward Model (RM) to predict preference scores for LLM responses, subsequently used to align an LLM policy via PPO.
Inputs: Input prompt Q and a pair of candidate responses (A1, A2)
Outputs: Scalar reward score representing the quality/preference of the response
Pipeline Flow
Seed RM Training (Human Data)
Self-Labeling (Unlabeled Data)
Status Identification & Data Filtering
RM Retraining (Pairwise Loss)
System Modules
Seed Reward Model
Provide initial noisy supervision
Model or implementation: Mistral / Llama 3 (exact size depends on experiment)
Status Identifier
Determine if RM should focus on easy (good vs bad) or hard (similar quality) samples
Model or implementation: Heuristic logic based on prediction statistics
Data Filter
Select high-confidence samples based on current status
Model or implementation: Rule-based filter
Evolved Reward Model
Learn improved preference representations
Model or implementation: Same architecture as Seed RM
Novel Architectural Elements
Two-phase curriculum learning logic (Status 1 vs. Status 2) integrated directly into the self-training loop to switch data selection strategies based on model maturity
Modeling
Base Model: Mistral and Llama 3 (exact sizes implied as standard variants, e.g., 7B/8B, but exact parameter counts not explicitly detailed in snippets provided)
Training Method: Self-Evolved Reward Learning (Iterative Retraining + PPO)
Objective Functions:
Purpose: Train the reward model to rank the chosen response higher than the rejected one.
vs. RLHF (Full Data): SER achieves comparable results using only 15% of the data
vs. RLAIF: SER does not require a stronger teacher model (like GPT-4); the model evolves its own RM
vs. Self-Rewarding LM: SER explicitly separates the learning statuses (discrimination vs. comparison) to guide data selection, rather than just training on all self-generated data
vs. ReSTEM [not cited in paper]: ReSTEM uses self-training for reasoning; SER applies self-evolution specifically to the Reward Model for alignment
Limitations
Relies on the assumption that the seed model (15% data) is good enough to generate some valid signals; extremely poor seed models might fail to evolve.
The thresholds for status identification (0.55, 0.45) are heuristics that might require tuning for different datasets.
Iterative retraining adds computational overhead compared to single-stage training.
Performance gains diminish as iterations proceed and the model converges.
Resources available at https://aka.ms/ser. The paper specifies thresholds (0.55, 0.45, 0.3) and seed data percentage (15%). Full hyperparameters for PPO (learning rate, batch size) are not detailed in the provided text.
📊 Experiments & Results
Evaluation Setup
Reward Model performance evaluation and downstream LLM alignment evaluation via PPO.
Benchmarks:
HH-RLHF (Dialogue preference (Helpfulness and Harmlessness))
UltraFeedback (General instruction following preference)
Metrics:
Win Rate (implied via 'improvement')
Reward Model Accuracy (implied)
Downstream LLM performance metrics (implied)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Various
Relative Data Usage
100
15
-85
Aggregate
Performance Improvement (%)
0
7.88
+7.88
Main Takeaways
Learning from self-feedback can robustly enhance Reward Model performance even with limited human-annotated data (15%).
The distinction between learning statuses (Status 1: Good vs Bad, Status 2: Fine-grained comparison) is crucial for selecting high-confidence data.
Performance improvements are consistent across different model sizes (Mistral, Llama 3) and datasets (HH-RLHF, UltraFeedback).
Self-evolution allows the model to eventually surpass the performance of the initial seed model and match models trained on full datasets.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Reward Modeling / Preference Learning
Knowledge Distillation / Self-Training
Key Terms
RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences
RLAIF: Reinforcement Learning from AI Feedback—using AI-generated labels instead of human labels to train reward models
RM: Reward Model—a model that predicts a scalar score indicating how good a response is
PPO: Proximal Policy Optimization—an RL algorithm used to train the LLM policy to maximize rewards
Pairwise Loss: A loss function that trains the model to assign a higher score to the preferred response in a pair: -log(sigmoid(reward_winner - reward_loser))
Status 1: A learning state defined in this paper where the RM focuses on distinguishing clearly good answers from clearly bad ones
Status 2: A learning state defined in this paper where the RM focuses on distinguishing subtle differences between answers of similar quality
DPO: Direct Preference Optimization—an alternative to RLHF that optimizes the policy directly on preferences without a separate reward model