Improve LLM-as-a-Judge Ability as a General Ability

📝 Paper Summary

LLM-as-a-Judge Automated Evaluation Reward Modeling

RISE-Judge is a generative judge model trained via a two-stage process (SFT warm-up and DPO enhancement) using highly efficient synthesized data, achieving state-of-the-art judgment accuracy while preserving general chat capabilities.

Core Problem

Training effective LLM judges typically requires massive, costly annotated datasets (600-900k entries), and many existing methods focus solely on judgment metrics, often degrading the model's general reasoning and chat abilities.

Why it matters:

Manual annotation for RLHF is prohibitively expensive and slow for scaling modern LLMs, making accurate automated judges (LLM-as-a-Judge) critical
Existing judge models often suffer from 'taxonomic forgetting' where they become good at scoring but lose general conversational ability
Current data synthesis methods for judge training are inefficient, requiring hundreds of thousands of samples to achieve competitive performance

Concrete Example: A standard judge model might correctly rate a math solution but fail to generate a coherent explanation or solve the math problem itself. RISE-Judge, conversely, can both accurately judge the error in a math problem (identifying specific logical flaws) and generate the correct step-by-step solution itself, as shown in the paper's case study on RewardBench math problems.

Key Novelty

Two-Stage 'Warm-up then Enhance' Training with Efficient Data Synthesis

Decomposes training into 'SFT Warm-Up' (learning the format and reasoning style using CoT) and 'DPO Enhancement' (learning subtle preference distinctions on harder examples)
Uses a data synthesis pipeline that rewrites instructions to be judge-specific, filters for position/length bias, and selects only 'hard' samples (where GPT-4o is inconsistent) for the DPO stage
Demonstrates that training for judge capability—specifically the requirement to critically analyze and reason step-by-step—transfer positively to general reasoning tasks.

Architecture

The data synthesis and two-stage training pipeline.

Evaluation Highlights

Achieves SOTA on RewardBench (score 92.7) with only 40k training samples, outperforming GPT-4o and larger models that use 10-20x more data
Matches the general chat performance of Qwen2.5-32B-Instruct on benchmarks like MMLU (82.7) and GSM8K (84.4), proving judge training doesn't degrade general ability
Downstream policy models trained with RISE-Judge labels outperform those trained with GPT-4o labels on AlignBench (7.78 vs 7.61 overall score)

Breakthrough Assessment

8/10

Highly efficient data strategy (using <5% of typical data volume) to achieve SOTA is significant. Validating that judge training enhances general ability addresses a key concern in the field.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference prediction where the model must generate a chain-of-thought rationale followed by a judgment

Inputs: A tuple (Instruction, Response A, Response B)

Outputs: A generated text sequence containing a detailed analysis (CoT) and a final verdict (e.g., '[[A]]')

Pipeline Flow

Instruction Rewriting (Data Synthesis)
SFT Warm-Up (Stage 1 Training)
Hard Sample Mining (Data Synthesis)
DPO Enhancement (Stage 2 Training)

System Modules

Instruction Rewriter

Generate diverse judge prompts with varying personas and criteria from raw QA pairs

Model or implementation: GPT-4o

Judge Model (SFT) (Training)

Learn reasoning patterns and output format; generate hard negatives for next stage

Model or implementation: Qwen2.5-32B-Base

Judge Model (DPO) (Training)

Refine preference discrimination on hard samples

Model or implementation: SFT Checkpoint

Novel Architectural Elements

Integrated pipeline where the SFT model is used to mine its own 'hard' samples (where it disagrees with ground truth or is inconsistent) for the subsequent DPO stage

Modeling

Base Model: Qwen2.5-32B-Base

Training Method: Two-stage: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: SFT to learn judge format.

Formally: Standard causal language modeling loss on (Instruction, CoT, Verdict) sequences.
Purpose: DPO to refine preference accuracy.

Formally: L_DPO(π_θ; π_ref) + β * L_NLL(π_θ), where L_NLL prevents degradation of generation quality.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (32B)

Training Data:

SFT Data: 20k samples synthesized from Math-PRM800K and Skywork-Reward-Preference-80K. Filtered for position/length bias using GPT-4o.
DPO Data: 20k pairs. Constructed by sampling responses from the SFT model on questions where GPT-4o was inconsistent/incorrect. Pairs (chosen, rejected) are judgments that align/misalign with ground truth.

Key Hyperparameters:

learning_rate_sft: 2e-5
learning_rate_dpo: 1e-6
batch_size_sft: 128
+ 5 more
batch_size_dpo: 32
epochs_sft: 2
epochs_dpo: 2
dpo_beta: 0.1
nll_alpha: 0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Auto-J/JudgeLM: Uses significantly less data (40k vs 100k-900k) via efficient synthesis and two-stage training.
vs. Standard Reward Models (e.g., Skywork-Reward): Generative approach (CoT) rather than scalar score, offering interpretability.
vs. Prometheus-2 [not cited in paper]: Prometheus focuses on rubric-following with extremely large proprietary datasets; RISE focuses on data efficiency and general ability preservation.

Limitations

Evaluation primarily focused on RewardBench; broader judge benchmarks could be explored.
Base model is relatively large (32B); effectiveness on smaller (7B) models is shown but less dominant.
Relies on GPT-4o for initial data synthesis and filtering, introducing a dependency on proprietary models.

Reproducibility

Code: https://huggingface.co/R-I-S-E/RISE-Judge-Qwen2.5-32B

Highly reproducible. Model weights (RISE-Judge-Qwen2.5-32B) and datasets (SFT-20K, DPO-20K) are open-sourced on HuggingFace. Detailed hyperparameters provided. Synthesis prompts are described but exact template files not explicitly linked, though descriptions are detailed.

📊 Experiments & Results

Evaluation Setup

Evaluation of both judge capability (pairwise preference accuracy) and general capability (standard chat benchmarks).

Benchmarks:

RewardBench (Pairwise Preference Judgment)
AlignBench (General Chat / Instruction Following)
MT-Bench (Multi-turn Conversation)
MMLU / GSM8K / MATH (General Knowledge & Reasoning)

Metrics:

Accuracy (RewardBench)
Score (0-10 on AlignBench/MT-Bench)
Pass@1 (Math benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RISE-Judge achieves SOTA performance on RewardBench compared to other generative and specialized reward models.
RewardBench	Overall Score	62.8	92.7	+29.9
RewardBench	Overall Score	88.1	92.7	+4.6
Ablation studies confirm the necessity of the two-stage training process.
RewardBench	Overall Score	87.7	92.7	+5.0
General capability evaluation shows no degradation and competitive performance with instruction-tuned models.
MMLU	Accuracy	82.8	82.7	-0.1
GSM8K	Accuracy	86.1	84.4	-1.7
Downstream application: Training a policy model using RISE-Judge labels vs. GPT-4o labels.
AlignBench	Overall Score	7.61	7.78	+0.17

Experiment Figures

Ablation study on data scaling for SFT and DPO stages.

Robustness across different prompt languages and styles (Official English, Basic Chinese, Instructional Chinese).

Main Takeaways

Data efficiency is unlocked by splitting the task: SFT for format/style and DPO for preference discrimination allows SOTA results with only 40k samples.
Judge training can be a 'general ability' enhancer: Unlike previous specialized reward models, this generative judge approach preserves (and in some specific reasoning aspects utilizes) general capabilities.
Self-correction via DPO: Using the SFT model to mine its own hard negatives (where it is uncertain or incorrect) is highly effective for the second stage of training.
Practical utility: The model is not just a benchmark winner but effectively supervises downstream model training (RLAIF), outperforming GPT-4o in that specific role.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Chain-of-Thought (CoT) reasoning
LLM-as-a-Judge evaluation methodology

Key Terms

LLM-as-a-Judge: Using a large language model to evaluate the quality of text generated by other models, effectively replacing human annotators

SFT Warm-Up: Supervised Fine-Tuning phase where the model learns the format and reasoning style of a judge using high-quality demonstrations

DPO: Direct Preference Optimization—a stable method for aligning language models to preferences without training a separate reward model

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

NLL loss: Negative Log-Likelihood loss, added as a regularization term during DPO to maintain generation quality

RewardBench: A benchmark dataset designed to evaluate reward models and judge models on their ability to correctly identify preferred responses

Position Bias: The tendency of a judge model to prefer the first (or second) response regardless of content; mitigated here by swapping response orders

Length Bias: The tendency of a judge model to prefer longer responses regardless of quality