Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

📝 Paper Summary

Visual Reward Modeling Reinforcement Learning from Human Feedback (RLHF)

JRM aligns discriminative reward models with generative reasoning capabilities by jointly optimizing preference ranking and language modeling on a shared backbone, enabling efficient inference without explicit text generation.

Core Problem

Existing reward models face a trade-off: discriminative models are efficient but struggle with complex semantics, while generative models have strong reasoning but are computationally expensive and difficult to align with human preferences.

Why it matters:

Image editing tasks require verifying global semantic consistency and logical constraints, which shallow discriminative models often miss
Generative reward models (using explicit Chain-of-Thought) introduce high latency, making them impractical for large-scale online reinforcement learning loops
The mismatch between language modeling objectives and preference ranking makes it hard to align generative models directly with human comparison data

Concrete Example: In an image editing task, a discriminative model might rely on surface-level similarity and fail to penalize a result that ignores a 'change material to wood' instruction. A generative model could explain the error but is too slow to use repeatedly during RL training. JRM internalizes this reasoning into the score.

Key Novelty

Latent Chain-of-Thought (Latent CoT)

Jointly trains a shared vision-language backbone on two objectives: preference ranking (discriminative) and explanation generation (generative)
Language supervision forces the shared representation to encode deep semantic structure and reasoning logic
At inference time, the language head is discarded, retaining the 'internalized' reasoning capabilities in the efficient discriminative scoring head

Architecture

The JRM framework illustrating the shared backbone and dual-head training strategy.

Evaluation Highlights

Achieves 85.1% accuracy on EditReward-Bench, outperforming GPT-5 (75.5%) by 9.6%
Reaches 69.3% composite score on MMRB2 benchmark, surpassing GPT-5 (61.9%) by 7.4%
Increases effective feature space rank to 91.77 (vs. 46.86 for baseline), indicating prevention of representation collapse

Breakthrough Assessment

9/10

Proposes a paradigm shift that successfully bridges the efficiency of discriminative models with the reasoning depth of generative models, backed by SOTA results against future baselines like GPT-5.

⚙️ Technical Details

Problem Definition

Setting: Visual Reward Modeling for Image Editing

Inputs: Input image pair (pre-edit, post-edit) and editing instruction c

Outputs: Scalar reward score r representing alignment quality

Pipeline Flow

Vision-Language Backbone (Shared Encoder)
Reward Head (Discriminative Scoring)
Language Head (Generative Explanation - Training Only)

System Modules

Vision-Language Backbone

Encodes image and instruction into a shared latent representation h

Model or implementation: Shared vision-language backbone (specific architecture not named in text)

Reward Head

Maps the shared representation to a scalar reward score

Model or implementation: Lightweight discriminative head (Linear/MLP)

Language Head

Generates semantic evaluations/explanations to enforce reasoning structure in the shared representation

Model or implementation: Conditional language generation head

Novel Architectural Elements

Training-Inference Decoupling: Uses a dual-head architecture during training to inject reasoning, but physically removes the generative path during inference to maintain discriminative speed

Modeling

Base Model: Shared vision-language backbone

Training Method: Joint Multi-Objective Optimization

Objective Functions:

Purpose: Enforce relative preference ordering between samples.

Formally: Uncertainty-aware ranking loss L_rank = -log(sigmoid((mu_w - mu_l) / sqrt(sigma_w^2 + sigma_l^2)))
Purpose: Enforce global semantic structure and reasoning capability.

Formally: Cross-entropy loss L_lang over target explanation y
Purpose: Combine objectives.

Formally: L_total = L_rank + alpha * L_lang

Training Data:

Image editing reward datasets (from Wu et al., 2025d)
Augmented with automated multimodal evaluation explanations (Bai et al., 2023, 2025)

Key Hyperparameters:

alpha: 0.7 (Language supervision weight)

Compute: Zero test-time reasoning overhead (inference uses only discriminative head)

Comparison to Prior Work

vs. EditReward: JRM adds a language modeling objective to internalize reasoning, whereas EditReward relies solely on ranking loss
vs. Generative RMs (e.g., GPT-5): JRM is discriminative at inference time (fast) while GPT-5 requires slow text generation
vs. EditScore: JRM jointly optimizes representations rather than just distilling final scores [not cited in paper]

Reproducibility

Code: https://github.com/Kwai-Keye/JRM-Joint-Reward-Modeling

Code publicly available at https://github.com/Kwai-Keye/JRM-Joint-Reward-Modeling. Uses existing datasets (EditReward-Bench, MMRB2) and standard RL algorithms (Flow-GRPO). Specific backbone architecture name not explicitly cited in the provided text.

📊 Experiments & Results

Evaluation Setup

Reward modeling benchmarks and Online RL (Flow-GRPO) for image editing

Benchmarks:

EditReward-Bench (Image Editing Evaluation)
MMRB2 (Multimodal Reward Benchmarking)
GEdit-Bench (Image Editing Generation (Downstream))

Metrics:

Accuracy (Prompt Following, Overall)
Composite Score (Pointwise, Pairwise)
Effective Rank (Representation Space Analysis)
Improvement Delta (in downstream RL)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
JRM significantly outperforms state-of-the-art baselines, including GPT-5, on standard reward modeling benchmarks.
EditReward-Bench	Overall Accuracy	75.5	85.1	+9.6
EditReward-Bench	Prompt Following Accuracy	Not reported in the paper	85.4	Not reported in the paper
MMRB2	Composite Score	61.9	69.3	+7.4
Representation analysis shows joint training prevents feature collapse.
Representation Space Analysis	Effective Feature Space Rank	46.86	91.77	+44.91
Downstream RL experiments demonstrate JRM guides generation models better than GPT-4.1.
GEdit-Bench	Performance Gain	0.45	1.00	+0.55
ImageEdit-Bench	Performance Gain	0.26	0.50	+0.24

Experiment Figures

Singular Value Decomposition (SVD) analysis of the feature representations.

Performance trends as the language supervision weight (alpha) increases.

Main Takeaways

JRM effectively bridges the gap between efficient discriminative models and reasoning-heavy generative models.
Joint training with language supervision (alpha > 0) significantly improves ranking accuracy and representation isotropy.
The 'Latent Chain-of-Thought' mechanism works: removing the language head at inference does not lose the reasoning benefits acquired during training.
Downstream RL fine-tuning using JRM yields greater performance gains than using GPT-4.1, validating its robustness as a reward signal.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Vision-Language Models (VLMs)
Chain-of-Thought (CoT) Reasoning

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to align AI models using human preference data

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps to improve performance

Latent CoT: Internalized reasoning capabilities encoded within a model's high-dimensional representation without generating explicit text output

SVD: Singular Value Decomposition—a mathematical method used here to analyze the dimensionality and collapse of the model's feature space

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used for aligning the generation model

VIEScore: A visual information extraction score used to evaluate the semantic consistency of images

Isotropy: The uniformity of the distribution of representations in the embedding space; higher isotropy often correlates with better representation quality