Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling

Fine-Grained RLHF improves language model training by using multiple dense reward models targeting specific error types (factuality, relevance, completeness) at different granularities (sentence, sub-sentence), rather than a single holistic preference score.

Core Problem

Standard RLHF uses a single scalar reward for an entire text sequence, which provides sparse training signals and fails to indicate which specific parts of a long output are problematic (e.g., false, toxic, or irrelevant).

Why it matters:

Holistic feedback conveys limited information for long-form text generation, making RLHF unreliable in complex domains like long-form QA
Annotators struggle to reliably compare overall quality when outputs contain a mixture of diverse errors (e.g., one output is factual but irrelevant, another is relevant but false)

Concrete Example: In long-form QA, a model might generate a paragraph where the first sentence is factual but the second is irrelevant. Standard RLHF gives one score for the whole paragraph, obscuring the specific error. Fine-Grained RLHF assigns a positive factual reward to sentence 1 and a negative relevance reward to sentence 2.

Key Novelty

Fine-Grained RLHF Framework

Introduces dense rewards: providing feedback after every segment (e.g., sentence or sub-sentence) rather than just at the end of the full sequence
Utilizes multiple reward models: training separate models for distinct error categories (e.g., irrelevance, incorrect facts, incompleteness) and combining them during RL optimization
Enables customization: allows adjusting weights of different reward models during training to control trade-offs between behaviors (e.g., prioritizing factuality over length)

Architecture

Contrast between Preference-based RLHF (Holistic) and Fine-Grained RLHF architecture.

Evaluation Highlights

Fine-Grained RLHF reduces toxicity to 0.081 on RealToxicityPrompts, outperforming holistic RLHF (0.130) and controlled generation baselines (GeDi: 0.154)
In Long-Form QA, Fine-Grained RLHF reduces factual error rate (sub-sentences with errors) vs. Preference RLHF, with human evaluation showing improved factuality (0.816 vs 0.781 reward score)
Human annotators rated Fine-Grained RLHF better than Preference-based RLHF in 30.5% of cases (vs 24.5% worse) despite Preference RLHF being optimized directly for preference

Breakthrough Assessment

8/10

Significant methodological advancement in RLHF by moving from sparse/holistic to dense/fine-grained rewards. Demonstrates clear gains in sample efficiency and control over specific model behaviors.

⚙️ Technical Details

Problem Definition

Setting: Language generation modeled as a Markov Decision Process (MDP) where the policy LM generates tokens and receives rewards

Inputs: Task input prompt x (e.g., a question)

Outputs: Generated sequence y (e.g., a long-form answer)

Pipeline Flow

Initial Policy LM (SFT)
Reward Model Training (Train separate RMs for different error types/densities)
RL Optimization (PPO updates policy using weighted sum of fine-grained rewards)

System Modules

Policy Model

Generates text based on prompts

Model or implementation: GPT-2 Large (Detoxification) / T5-large (QA)

Relevance Reward Model (R_phi1) (Reward Estimation)

Detects irrelevance, repetition, or incoherence at sub-sentence level

Model or implementation: Longformer-base

Factuality Reward Model (R_phi2) (Reward Estimation)

Detects incorrect or unverifiable facts at sentence level

Model or implementation: Longformer-base

Completeness Reward Model (R_phi3) (Reward Estimation)

Measures information completeness at full sequence level

Model or implementation: Longformer-base

Toxicity Reward Model (Reward Estimation)

Detects toxicity changes after each sentence

Model or implementation: Perspective API

Novel Architectural Elements

Integration of multiple reward models with different densities (sub-sentence, sentence, full sequence) into a single PPO reward function
Dense reward calculation: R(x,y) is calculated as sum of rewards at step t where segment ends

Modeling

Base Model: T5-large (770M parameters) for QA; GPT-2 Large (774M parameters) for Detoxification

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Optimize policy to maximize weighted sum of fine-grained rewards while staying close to initial policy.

Formally: r_t = sum(w_k * R_phi_k(x, y, j)) - beta * log(P_theta / P_init)

Training Data:

QA-Feedback dataset: 3,853 training, 500 dev, 948 test examples constructed from ASQA
Feedback collected: Error spans for irrelevance (sub-sentence), factuality (sentence), and completeness (sequence)

Key Hyperparameters:

learning_rate: 1e-5 (QA)
batch_size: 128 (QA)
ppo_clip_epsilon: 0.2
+ 3 more
kl_penalty_beta: 0.05 (QA), 0.02 (Detox)
discount_factor_gamma: 1.0
gae_lambda: 0.95

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Holistic RLHF: Provides dense rewards at sentence/sub-sentence level vs. sparse sequence-level reward
vs. GeDi/DExperts: Optimizes the model weights during training via RL vs. modifying decoding probabilities at inference time
vs. Fine-Grained Evaluation [not cited in paper]: Uses fine-grained signals for *training* (RL), not just for evaluation

Limitations

Higher compute cost during training due to multiple reward model queries per example (e.g., multiple API calls or forward passes)
Requires task-specific definition of fine-grained feedback types and densities
Constructing fine-grained feedback datasets requires significant manual effort and careful UI design

Reproducibility

Code: https://FineGrainedRLHF.github.io

publicly available (https://FineGrainedRLHF.github.io). Data, collected human feedback, and code are released. Reward models are described in detail (Longformer-base). Perspective API is a closed commercial API.

📊 Experiments & Results

Evaluation Setup

Two tasks: Detoxification (RealToxicityPrompts) and Long-Form QA (QA-Feedback based on ASQA)

Benchmarks:

RealToxicityPrompts (Detoxification)
QA-Feedback (Long-form Question Answering) [New]

Metrics:

Toxicity Score (Perspective API)
Perplexity (PPL)
RougeLSum
Fine-grained error rates (Human Evaluation)
Reward Model Scores
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RealToxicityPrompts	Toxicity (max)	0.130	0.081	-0.049
RealToxicityPrompts	PPL (Fluency)	11.75	9.77	-1.98
QA-Feedback	Factuality Reward Score (R_phi2)	0.781	0.816	+0.035
QA-Feedback	Factuality Error Rate (Human Eval)	See Figure 3	See Figure 3	Positive qualitative improvement
QA-Feedback	Completeness Reward Score (R_phi3)	0.101	0.139	+0.038
QA-Feedback	Avg Length	96.69	101.76	+5.07

Experiment Figures

Learning curves for Toxicity and Perplexity during RL training.

Dynamics of different reward types (Relevance, Factuality, Completeness) during training.

Main Takeaways

Dense fine-grained rewards are more sample efficient: Toxicity drops faster during training compared to holistic rewards while maintaining fluency.
Customization is possible: Adjusting weights of specific fine-grained reward models allows control over model behavior (e.g., creating 'short', 'medium', or 'long' responses).
Reward models compete: Optimizing for relevance (precision) often conflicts with completeness (recall), but fine-grained RLHF allows finding an equilibrium.
Fine-grained feedback annotation is cost-effective: Taking ~6 minutes per example, similar to preference annotation, but provides richer signals.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Language Models (GPT-2, T5)
Markov Decision Process (MDP)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to train language models using human preferences as a reward signal

PPO: Proximal Policy Optimization—an RL algorithm used to update the language model policy based on rewards

Fine-grained reward: A reward signal provided for specific segments (e.g., sentences) or specific error types, rather than a single score for the whole output

Holistic reward: A single scalar score representing the overall quality of an entire generated sequence

RougeLSum: An automatic metric for evaluating text generation based on the overlap of n-grams (specifically longest common subsequence) between generated and reference text

Perplexity (PPL): A measurement of how well a probability model predicts a sample; lower values indicate the text is more 'natural' or predictable to the model

SFT: Supervised Fine-Tuning—training the model on high-quality demonstration data before applying RL

KL divergence penalty: A term added to the reward function to prevent the RL-trained model from deviating too far from the initial pre-trained model