Critique-out-Loud Reward Models

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling Chain-of-Thought Reasoning

CLoud reward models improve preference modeling by generating a natural language critique of an assistant's response before predicting a scalar reward score, enabling explicit reasoning.

Core Problem

Traditional reward models act as simple classifiers that predict scores in a single forward pass, forcing them to reason implicitly about response quality without leveraging the generation capabilities of the underlying LLM.

Why it matters:

Implicit reasoning limits the performance of reward models compared to methods that use reasoning traces (like LLM-as-a-Judge)
Classic reward models often fail to capture nuanced preference signals that require step-by-step verification
LLM-as-a-Judge offers interpretability but generally underperforms classic reward models on pairwise classification benchmarks

Concrete Example: A classic reward model might score a math solution highly because it looks authoritative, failing to catch a subtle arithmetic error. A CLoud model would first generate a critique identifying the calculation mistake, then assign a lower score based on that explicit finding.

Key Novelty

Critique-out-Loud (CLoud) Reward Models

Unifies classic reward modeling with LLM-as-a-Judge by training the model to first generate a critique (reasoning trace) and then predict a scalar reward conditioned on that critique
Leverages 'inference-time compute' for reward modeling: allows the model to think before it scores, improving accuracy via Chain-of-Thought-style processing
Enables self-consistency over critiques: sampling multiple critiques and averaging the resulting scores to improve reward estimation stability

Architecture

Overview of CLoud reward models compared to classic reward models and LLM-as-a-Judge.

Evaluation Highlights

+4.65 (8B) and +5.84 (70B) percentage point improvement on RewardBench accuracy compared to classic reward models
Achieves Pareto improvement in win rates on ArenaHard when used as a Best-of-N scorer compared to standard reward models
Self-consistency (marginalizing over multiple sampled critiques) further improves RewardBench accuracy by up to 0.70% for 8B models on reasoning tasks

Breakthrough Assessment

8/10

Strong empirical results showing that adding reasoning traces to reward models significantly boosts performance. Effectively bridges the gap between efficient scalar reward models and interpretable judge models.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference modeling where a model predicts which of two responses (chosen vs. rejected) is better, based on a prompt

Inputs: User prompt x, Assistant response y

Outputs: Scalar reward score R_hat

Pipeline Flow

Input Processing: Prompt x + Response y
Critique Generation: Model generates critique c
Reward Prediction: Model predicts score R based on (x, y, c)

System Modules

Base Model + LM Head

Generate a natural language critique of the response quality

Model or implementation: Llama-3-8B-Instruct or Llama-3-70B-Instruct

Reward Head

Predict scalar reward score

Model or implementation: Linear layer on top of Base Model last hidden state

Novel Architectural Elements

Conditional Reward Head: The scalar reward is computed *after* the critique generation, conditioned on the full sequence (x, y, c), effectively using the critique as a latent reasoning step within the reward model architecture

Modeling

Base Model: Llama-3-8B-Instruct and Llama-3-70B-Instruct

Training Method: Multi-stage training: SFT on critiques + Joint SFT and Reward Modeling

Objective Functions:

Purpose: Optimize the reward head to score preferred responses higher than rejected ones using self-generated critiques.

Formally: L_CLoud = -log(sigmoid(r(x, y+, c+) - r(x, y-, c-)))
Purpose: Maintain critique generation quality during reward training.

Formally: L_SFT = -log p(c|x,y)
Purpose: Joint objective to train both capabilities simultaneously.

Formally: L_Total = L_CLoud + lambda * L_SFT

Adaptation: Full fine-tuning

Training Data:

Dataset: HelpSteer2 (21,354 prompt-response pairs with metadata)
Oracle Critiques: Generated using Llama-3.1-405B-Instruct to approximate human feedback

Key Hyperparameters:

learning_rate: 5e-6 (8B), 2e-6 (70B)
batch_size: 64 (8B), 128 (70B)
lambda (SFT weight): Not explicitly reported in the paper
+ 1 more
max_length: 2048 (input) + 2048 (output)

Compute: Training: 8 H100 GPUs (80GB). Inference: 1 H100 GPU (80GB).

Comparison to Prior Work

vs. Classic Reward Models: CLoud explicitly generates reasoning (critique) before scoring, enabling CoT-like benefits
vs. LLM-as-a-Judge: CLoud adds a trained scalar reward head on top of the reasoning trace rather than parsing a text score, improving pairwise accuracy
vs. Sheppard et al. (2023) [not cited in paper]: Similar idea of supervising reasoning traces, but CLoud emphasizes the on-policy training loop (self-generated critiques) to mitigate distribution shift

Limitations

Requires a powerful teacher model (like Llama-3.1-405B) to generate initial oracle critiques for training
Inference cost is higher than classic reward models due to the need to generate the critique tokens before scoring
Performance gain depends on the quality of the critiques; poor critiques could theoretically mislead the reward head (though on-policy training mitigates this)

Reproducibility

Code: https://github.com/j-d-chang/critique-out-loud

Code is publicly available at https://github.com/j-d-chang/critique-out-loud. Oracle critiques were generated using Llama-3.1-405B-Instruct. The paper explicitly lists prompt templates in Appendix A.

📊 Experiments & Results

Evaluation Setup

Pairwise preference classification and Best-of-N (BoN) sampling

Benchmarks:

RewardBench (Pairwise preference classification (Chat, Chat-Hard, Safety, Reasoning))
ArenaHard (Open-ended generation quality (assessed via Win Rate))

Metrics:

Accuracy (pairwise preference)
Win Rate (ArenaHard)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main classification results demonstrating CLoud superior accuracy over classic baselines on RewardBench.
RewardBench	Accuracy	83.74	88.39	+4.65
RewardBench	Accuracy	86.85	92.69	+5.84
Results showing the impact of self-consistency decoding (sampling multiple critiques) on reasoning tasks.
RewardBench (Reasoning subset)	Accuracy	93.47	94.17	+0.70
Ablation study highlighting the critical importance of on-policy training (training on self-generated critiques).
RewardBench	Accuracy	82.68	88.39	+5.71

Experiment Figures

Best-of-N Win Rate on ArenaHard for CLoud 8B vs Classic 8B Reward Models.

Effect of Self-Consistency (number of critique samples K) on RewardBench accuracy.

Main Takeaways

Explicit critique generation significantly improves reward modeling accuracy compared to implicit reasoning in classic reward models
On-policy training (training the reward head on the model's own critiques rather than oracle critiques) is essential to prevent distribution shift and ensure high performance
CLoud models effectively trade inference-time compute for accuracy: generating critiques takes longer but yields better sorting of responses
Self-consistency (marginalizing over multiple critiques) provides further gains, particularly on reasoning-heavy tasks

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry preference model
Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—aligning language models to follow human intent using rewards derived from preference data

Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their latent reward scores

Best-of-N: An inference strategy where N responses are generated, scored by a reward model, and the highest-scoring response is selected

Chain-of-Thought: A prompting technique where models generate intermediate reasoning steps before producing a final answer

Self-consistency: An inference technique that samples multiple reasoning paths and aggregates the results (e.g., via voting or averaging) to improve reliability

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs

Off-policy training: Training on data generated by a different policy (e.g., oracle critiques) rather than the model's own current predictions

On-policy training: Training the model on its own generated outputs (self-generated critiques) to reduce distribution shift during inference