J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

📝 Paper Summary

LLM-as-a-Judge Reward Modeling Reasoning/Chain-of-Thought

J1 transforms subjective preference tasks into verifiable RL challenges to train a thinking judge that produces chain-of-thought reasoning before outputting verdicts, achieving state-of-the-art evaluation performance using only synthetic data.

Core Problem

Standard reward models output scores without explicit reasoning, limiting their accuracy and interpretability, while existing LLM-as-a-Judge methods often rely on costly human data or lack direct optimization for reasoning quality.

Why it matters:

Evaluation quality bottlenecks AI progress; poor judges cannot reliably distinguish better models during training (RLHF) or benchmarking
Subjective tasks (e.g., chat) lack ground truth, making it difficult to apply verifiable reinforcement learning rewards to improve judgment capabilities
Standard pairwise judges suffer from severe positional bias (preferring the first option), reducing reliability

Concrete Example: When evaluating a math problem response, a standard judge might fail to notice a subtle calculation error in step 3. J1 explicitly reasons: 'Checking step 3... calculation is wrong', identifies the error, and penalizes the response, whereas a non-thinking judge might hallucinate correctness based on surface form.

Key Novelty

J1 (Thinking-LLM-as-a-Judge via RL)

Unifies verifiable (math) and subjective (chat) tasks into a single format where 'correctness' is defined by verifiable rewards on synthetic preference pairs, allowing RL to optimize judgment across domains
Trains a 'thinking' judge using GRPO (Group Relative Policy Optimization) to generate explicit reasoning traces before the verdict, incentivized by verdict correctness and consistency rewards
Develops a multitask architecture that learns both pairwise (comparison) and pointwise (scoring) evaluation simultaneously to mitigate position bias and improve robustness

Architecture

The J1 training framework pipeline, illustrating the flow from synthetic data generation to RL training with verifiable rewards.

Evaluation Highlights

J1-Qwen-32B-MultiTask achieves 93.6 on RewardBench, outperforming all previous generative reward models and significantly larger scalar reward models
On PPE Correctness, J1-Qwen-32B-MultiTask scores 76.8%, outperforming DeepSeek-GRM-27B (+17%) and EvalPlanner (+6.8%) while using significantly less training data
J1-Llama-8B outperforms larger 27B scalar reward models (Skywork-Reward-Gemma-2-27B) on PPE Correctness (59.2 vs 54.7)

Breakthrough Assessment

9/10

Demonstrates that strong reasoning (thinking) can be induced in judges via RL on purely synthetic data, beating closed-source models (o1-mini) and much larger open models on judgment tasks.

⚙️ Technical Details

Problem Definition

Setting: Pairwise and Pointwise evaluation of LLM responses

Inputs: Instruction x and either a single response a (pointwise) or a pair of responses (a, b) (pairwise)

Outputs: Thought tokens t followed by a verdict y (preferred response) or score s

Pipeline Flow

Input Instruction & Response(s)
Thinking Generation (Chain-of-Thought)
Verdict/Score Generation

System Modules

Thinking Generator

Generate intermediate reasoning trace (evaluation criteria, reference answer generation, critique)

Model or implementation: Llama-3.1-Instruct or Qwen3 (8B, 32B, 70B)

Verdict Head (Generative)

Output final decision based on reasoning

Model or implementation: Same shared LLM backbone as Thinking Generator

Novel Architectural Elements

Unified Multi-Task Head: Single model architecture capable of performing both Pointwise (scoring) and Pairwise (ranking) evaluations within the same context window
Consistency-Enforcing Training: Joint optimization where pointwise scores are trained via distant supervision from pairwise labels to ensure ranking consistency

Modeling

Base Model: Llama-3.1-Instruct (8B, 70B) and Qwen3 (32B)

Training Method: Online Reinforcement Learning using GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Reward the model for correctly identifying the better response in a synthetic pair.

Formally: Binary reward (+1 if correct verdict, 0 otherwise).
Purpose: Penalize position bias.

Formally: Consistency reward (+1 only if verdicts for (a,b) and (b,a) are both correct, 0 otherwise).

Training Data:

22K total synthetic samples
17K WildChat (subjective) prompts with noisy-instruction rejected pairs
5K MATH (verifiable) prompts with incorrect-answer rejected pairs
Data augmented with swapped orders (x,a,b) and (x,b,a)

Key Hyperparameters:

max_thought_length: approx 500 tokens (converged length)
framework: verl

Compute: Not reported in the paper

Comparison to Prior Work

vs. EvalPlanner: J1 uses online RL (GRPO) with verifiable rewards instead of offline DPO, and unifies verifiable/subjective tasks
vs. DeepSeek-R1: J1 is specifically specialized for judgment tasks using a targeted reward scheme (consistency + verdict correctness), whereas R1 is a general reasoner
vs. Scalar RMs (Skywork, Armo): J1 generates explicit reasoning traces (interpretable) and performs better with smaller model sizes

Limitations

Relies on synthetic data quality; bias in synthetic data generation could propagate to the judge
Pointwise scoring in isolation is still harder than pairwise comparison despite multitask training improvements
Inference cost is higher than scalar reward models due to generation of thinking tokens

Reproducibility

Code availability is not provided. Synthetic training data generation process is described (using WildChat and MATH). Seed prompts for thinking are provided in Appendix B. Model weights are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Pairwise judgment accuracy across diverse benchmarks covering math, code, and chat

Benchmarks:

PPE (Preference Proxy Evaluations) (Subjective Preference & Verifiable Correctness (MATH, MMLU-Pro, etc.))
RewardBench (General Reward Model Benchmark (Chat, Safety, Reasoning))
JudgeBench (Challenging response pairs (Knowledge, Math, Coding))
RM-Bench (Robustness to subtle differences and style bias)
FollowBenchEval (Constraint satisfaction evaluation)

Metrics:

Accuracy (Random Order)
Position-Consistent Accuracy (correct on both orders)
Statistical methodology: p-values reported for PPE Correctness comparisons (p<0.0001)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on PPE Correctness, a benchmark linking reward models to real-world preference performance.
PPE Correctness	Accuracy	70.2	76.8	+6.6
PPE Correctness	Accuracy	59.8	76.8	+17.0
PPE Correctness	Accuracy	54.7	59.2	+4.5
Performance on RewardBench, a standard leaderboard for reward models.
RewardBench	Overall Score	91.2	93.6	+2.4
Comparison against general purpose Thinking-LLMs.
PPE Correctness	Accuracy	73.9	76.8	+2.9
PPE Correctness	Accuracy	74.7	76.8	+2.1

Experiment Figures

Distribution of scores and score differences for Pointwise vs Pairwise J1 models.

Effect of test-time scaling (Best-of-N / Self-Consistency) on position-consistent accuracy and tie rates.

Training dynamics: Reward curves and Thought Length over training steps.

Main Takeaways

Online RL (GRPO) with verifiable rewards significantly outperforms offline DPO (EvalPlanner) for training judges.
Unified training on verifiable (Math) and subjective (Chat) tasks creates a robust generalist judge.
Multitask training (Pointwise + Pairwise) yields the best performance, leveraging the consistency of pointwise scoring and the discriminatory power of pairwise comparison.
The 'thinking' process allows the model to generate reference answers and criteria, leading to better judgments than scalar models or non-thinking LLMs.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
LLM-as-a-Judge paradigm
Chain-of-Thought (CoT) prompting

Key Terms

LLM-as-a-Judge: Using a Language Model to evaluate the quality of text generated by other models, often replacing human annotation

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input, often used to reduce variance without a separate critic model

CoT: Chain-of-Thought—a prompting technique that encourages models to generate intermediate reasoning steps before the final answer

DPO: Direct Preference Optimization—an offline method for aligning language models to preferences without explicit reward modeling

Verifiable Rewards: Rewards based on objectively checkable outcomes (e.g., correct math answer, correct preference prediction) rather than learned approximations

Position Bias: The tendency of LLM judges to prefer the first (or second) option presented, regardless of actual quality

Pointwise Evaluation: Evaluating a single response in isolation to assign it a score

Pairwise Evaluation: Comparing two responses side-by-side to determine which is better

Synthetic Data: Training data generated by AI models rather than collected from humans

Distant Supervision: Training a model (here, pointwise judge) using labels from a related task (pairwise preference) rather than direct annotations