The Perfect Blend: Redefining RLHF with Mixture of Judges

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Multi-Task Learning (MTL) Safety and Alignment

CGPO treats multi-task LLM alignment as a constrained optimization problem, using specific judges to detect reward hacking and enforcing constraints per task rather than blending conflicting rewards linearly.

Core Problem

Standard RLHF fails in multi-task settings because linearly combining rewards creates conflicting objectives (e.g., safety vs. helpfulness) and over-optimization leads to reward hacking where models exploit proxy reward flaws.

Why it matters:

Linear reward combinations force tradeoffs, degrading performance on both tasks (e.g., sacrificing helpfulness for safety)
Models optimized for a single proxy reward eventually produce high-scoring but gibberish or harmful outputs (reward hacking)
Uniform RL hyperparameters across diverse tasks (coding vs. creative writing) prevent optimal performance

Concrete Example: In coding tasks, PPO often exhibits 'severe regression' where the model learns to generate code that tricks the reward model but fails execution tests (0-shot accuracy drops). CGPO uses execution feedback as a hard constraint to prevent this.

Key Novelty

Constrained Generative Policy Optimization (CGPO)

Treats alignment as a constrained problem: maximize rewards subject to satisfying specific rules (constraints) rather than just maximizing a sum of rewards.
Uses a 'Mixture of Judges' (rule-based and LLM-based) to continuously monitor outputs during training; if a constraint is violated (e.g., code doesn't compile), the update is penalized.
Separates tasks completely during optimization, applying distinct reward models and hyperparameters to each instead of a single global mix.

Architecture

Overview of the CGPO pipeline. It shows the flow from Task-Specific Prompts -> LLM Policy -> Response -> Mixture of Judges (Rule & LLM based) -> Constrained RL Optimizer.

Evaluation Highlights

+7.4% improvement over PPO on AlpacaEval-2 (General Chat) using the CRRAFT optimizer
+12.5% improvement over PPO on Arena-Hard (STEM & Reasoning) using the CRRAFT optimizer
Eliminates reward hacking regression in coding: PPO's HumanEval score drops significantly during training, while CGPO maintains steady improvement (+5% over PPO)

Breakthrough Assessment

8/10

Strong empirical gains across diverse benchmarks and a principled solution to the well-known reward hacking problem. The move from linear reward mixing to constrained optimization is a significant methodological shift.

⚙️ Technical Details

Problem Definition

Setting: Constrained Markov Decision Process (CMDP) for Language Modeling

Inputs: Prompt s

Outputs: Response a

Pipeline Flow

Task-Specific Setup: Prompts segregated by task (Chat, Math, Coding, etc.)
Generation: Policy generates responses
Evaluation: Mixture of Judges (MoJ) evaluates responses for constraint violations
Optimization: Constrained RL update (CRPG, CODPO, or CRRAFT) applied independently per task

System Modules

Policy Model

Generate responses to prompts

Model or implementation: Llama-3-70B-Instruct (initialized from SFT)

Mixture of Judges (MoJ)

Check for constraint violations (Reward Hacking detection)

Model or implementation: Hybrid: Rule-based (e.g., code execution, regex) and LLM-based judges

Constrained Optimizer

Update policy weights to maximize reward subject to MoJ constraints

Model or implementation: CRPG / CODPO / CRRAFT algorithms

Novel Architectural Elements

Task-segregated optimization pipeline: Each task (e.g., Math, Coding) has its own optimizer instance, reward model, and judge set, rather than one global loop.
Primal-type constrained update mechanism: Incorporates constraint satisfaction directly into the policy update targets without complex dual-variable (Lagrangian) loops.

Modeling

Base Model: Llama-3-70B-Instruct

Training Method: Constrained Generative Policy Optimization (CGPO) with three variants: CRPG, CODPO, CRRAFT

Objective Functions:

Purpose: Maximize expected reward while ensuring the probability of violating constraints is below a threshold.

Formally: Maximize E[r(s,a)] subject to E[Cost(s,a)] <= alpha and KL(pi || pi_ref) <= delta.
Purpose: CRPG (Calibrated-Regularized Policy Gradient) objective.

Formally: Modifies the advantage function to penalize trajectories that violate constraints, effectively filtering them out or down-weighting them.
Purpose: CRRAFT (Calibrated-Regularized Reward Ranking Finetuning) objective.

Formally: Filters generated data to keep only samples satisfying constraints, then effectively performs rejection sampling/fine-tuning on the top-ranked survivors.

Adaptation: Full fine-tuning (implied by context of Llama-3-70B RLHF)

Trainable Parameters: All parameters (standard RLHF setting)

Key Hyperparameters:

kl_coefficient: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO: PPO uses a linear combination of rewards and suffers from reward hacking; CGPO uses constraints to clip bad behaviors and separate task optimizations.
vs. DPO: DPO is typically offline and lacks explicit constraint handling mechanisms for runtime generation checks.
vs. IPO (Identity Preference Optimization) [not cited in paper]: IPO also adds regularization to DPO but does not explicitly handle multi-objective constraints via judges like CGPO.

Limitations

Requires designing specific judges (MoJ) for each task, which may be non-trivial for subjective tasks.
Computational cost of running multiple judges (especially LLM-based ones) during training is high.
Specific hyperparameters for the new optimizers (CRPG, CRRAFT) are not fully detailed.

📊 Experiments & Results

Evaluation Setup

Multi-task post-training across 5 domains: General Chat, Instruction Following, Math/Reasoning, Engagement, and Safety.

Benchmarks:

AlpacaEval-2 (General Chat)
Arena-Hard (STEM & Reasoning)
IFEval (Instruction Following)
MATH (Math Reasoning)
GSM8K (Math Reasoning)
HumanEval (Coding)
MBPP (Coding)
ARC Challenge (Knowledge)

Metrics:

Win rates (AlpacaEval, Arena-Hard)
Accuracy (MATH, GSM8K, ARC)
Pass@1 (HumanEval, MBPP)
False Refusal Ratio (Safety)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CGPO variants (CRRAFT and CRPG) consistently outperform PPO baselines across general chat and reasoning benchmarks.
AlpacaEval-2	Win Rate %	32.0	39.4	+7.4
Arena-Hard	Win Rate %	40.0	52.5	+12.5
IFEval	Accuracy %	Not reported in text	Not reported in text	+2.0
HumanEval	Pass@1 %	Not reported in text	Not reported in text	+5.0
MATH	Accuracy %	Not reported in text	Not reported in text	+2.0

Main Takeaways

CRRAFT is superior for open-ended generation tasks (AlpacaEval, Arena-Hard, TruthfulQA).
CRPG is superior for reasoning and strict-criteria tasks (MATH, GSM8K, Coding, ARC).
PPO suffers severe reward hacking in coding tasks (performance drops after N steps), while CGPO with code-execution judges prevents this and continues to improve.
Task-specific optimization prevents the 'alignment tax' where improving safety degrades helpfulness, allowing better Pareto frontiers.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Proximal Policy Optimization (PPO)
Direct Preference Optimization (DPO)
Constrained Optimization (Primal-Dual methods)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences

Reward Hacking: When an RL agent exploits flaws in the reward model to get high scores without actually satisfying the user's intent

Mixture of Judges (MoJ): A set of evaluators (rule-based or model-based) that check specific constraints (e.g., 'is the code valid?', 'is the response safe?')

PPO: Proximal Policy Optimization—a standard RL algorithm that limits how much the policy changes in each step

DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without an explicit reward model loop

CRPG: Calibrated-Regularized Policy Gradient—one of the proposed optimizers effectively handling constraints

CRRAFT: Calibrated-Regularized Reward Ranking Finetuning—a proposed optimizer based on filtering and ranking samples

CODPO: Constrained Online Direct Preference Optimization—a proposed constrained version of online DPO

Pareto optimal: A state where no objective can be improved without worsening another (e.g., improving safety without hurting helpfulness)

SFT: Supervised Fine-Tuning—the initial phase of training on high-quality demonstrations

KL penalty: Kullback-Leibler divergence penalty—keeps the RL model from drifting too far from the reference model