EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

📝 Paper Summary

Instruction-guided image editing Reinforcement Learning (RL) for vision Reward modeling for image generation

EditScore is a specialized high-fidelity reward model for image editing that enables effective online reinforcement learning by surpassing general-purpose VLMs in evaluating instruction adherence and quality.

Core Problem

Applying Reinforcement Learning to image editing fails because current reward signals are either too expensive (proprietary VLMs) or inaccurate (open-source VLMs), leading to unstable training or policy collapse.

Why it matters:

Current image editing models struggle with complex instructions and often require multiple trial-and-error attempts to get good results.
RL has successfully improved text-to-image generation (e.g., flow matching), but editing lacks the reliable oracle needed for similar progress.
Even large open-source models like Qwen2.5-VL-72B fail to provide consistent reward signals, stalling open research in RL-based editing.

Concrete Example: When using a general VLM as a reward function for RL, the policy often collapses or learns to game the reward because the VLM cannot reliably distinguish between a subtle correct edit (e.g., 'change background to snowy') and a high-quality but wrong image, unlike the proposed EditScore which correctly identifies fine-grained errors.

Key Novelty

Specialized Generative Reward Model with Self-Ensembling (EditScore)

Fine-tunes a VLM (Qwen2.5-VL) specifically to evaluate image edits by generating both reasoning and scores for Semantic Consistency and Perceptual Quality.
Uses an inference-time ensembling strategy where the model generates multiple reasoning paths and scores, averaging them to produce a lower-variance, higher-fidelity reward signal.
Establishes a rigorous benchmark (EditReward-Bench) to validate reward model performance against human expert judgments before using it for RL.

Evaluation Highlights

EditScore-72B achieves 86.36% accuracy on EditReward-Bench, surpassing GPT-4o (84.41%) and GPT-5 (85.29%).
Using EditScore as the reward for online RL training improves the OmniGen2 base model's editing success rate by +14.6% on GEdit-Bench.
In Best-of-N selection, EditScore improves the performance of diverse editors (e.g., Qwen-Image-Edit) by picking better outputs than random selection.

Breakthrough Assessment

9/10

Significantly advances RL for image editing by solving the primary bottleneck—the lack of a reliable open-source reward model. Outperforming GPT-5 on the benchmark is a major claim.

⚙️ Technical Details

Problem Definition

Setting: Instruction-guided image editing evaluation and policy optimization via Reinforcement Learning.

Inputs: Input triplet: (Instruction, Input Image, Output Image)

Outputs: Scalar reward score reflecting edit quality and instruction adherence

Pipeline Flow

Input Processing (Instruction + Images)
Generative Scoring (EditScore)
Ensemble Aggregation

System Modules

Input Processor

Formats the input triplet for the VLM

Model or implementation: Qwen2.5-VL tokenizer

EditScore Generator

Generates reasoning and scalar scores for SC and PQ

Model or implementation: Fine-tuned Qwen2.5-VL (7B or 72B)

Score Aggregator

Combines SC and PQ scores from multiple runs into a final reward

Model or implementation: Arithmetic Mean Aggregation

Novel Architectural Elements

Inference-time self-ensembling for reward stability: The reward model is explicitly designed to be run K times stochastically and averaged, treating the variance of reasoning paths as a feature to robustness rather than a bug.

Modeling

Base Model: Qwen2.5-VL-7B and Qwen2.5-VL-72B

Training Method: Supervised Fine-Tuning (SFT) on reasoning-score pairs

Objective Functions:

Purpose: Train the model to generate correct reasoning and scores.

Formally: Standard autoregressive language modeling loss on the target tokens (reasoning + scores).

Adaptation: Full fine-tuning

Training Data:

70,000 samples for reward model training
Images selected via K-center greedy algorithm for diversity
Outputs generated by 5 random editing models
Scored by GPT-4.1 with filtering for low-discriminability cases

Key Hyperparameters:

ensemble_size_K: Not explicitly reported in the paper (implied parameter for inference)
RL_algorithm: PPO (for the downstream policy optimization)

Compute: EditScore models are 7B and 72B parameters. Inference cost scales linearly with ensemble size K.

Comparison to Prior Work

vs. VIEScore: Open-source, significantly faster/cheaper, capable of online RL integration
vs. PickScore: Specialized for editing (checking source image consistency) rather than just caption alignment
vs. Qwen2.5-VL-72B (Base): Fine-tuned specifically for editing evaluation, significantly reducing variance and hallucination in scoring
+ 1 more
vs. GPT-4o: Higher accuracy on EditReward-Bench via specialized training and self-ensembling

Limitations

Computational cost of inference-time ensembling (running the 72B model K times is expensive)
Dependence on synthetic data generated by GPT-4 for training (distillation bias)
RL training experiments limited to OmniGen2; generalization to other architectures not fully explored

Reproducibility

Code, models, data, and benchmark will be released publicly. Current paper links to repository are placeholders or not provided. Artifacts include the EditScore model weights, EditReward-Bench dataset, and the RL training framework.

📊 Experiments & Results

Evaluation Setup

Evaluation of reward model accuracy against human judgments and utility in downstream RL tasks.

Benchmarks:

EditReward-Bench (Reward Model Evaluation) [New]
GEdit-Bench (Image Editing Quality)

Metrics:

Preference Prediction Accuracy (human alignment)
Editing Success Rate (for RL policy)
CLIP Score / LPIPS (traditional metrics)
Statistical methodology: Two-annotator discussion protocol for ground truth creation to ensure high agreement.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EditScore outperforms both open-source and proprietary baselines on the newly proposed EditReward-Bench.
EditReward-Bench	Accuracy	84.41	86.36	+1.95
EditReward-Bench	Accuracy	72.43	86.36	+13.93
Using EditScore for Reinforcement Learning significantly improves the editing capabilities of the OmniGen2 model.
GEdit-Bench	Success Rate	Not explicitly reported in the paper	Not explicitly reported in the paper	+14.6

Experiment Figures

Comparison of EditScore accuracy vs. other VLMs (GPT-4o, Gemini, etc.) on EditReward-Bench.

Main Takeaways

High-fidelity reward modeling is the unlock for RL in image editing; general VLMs are too noisy.
Inference-time ensembling (Self-Ensemble) provides a consistent performance boost, allowing open-source models to beat proprietary giants like GPT-5.
The proposed EditReward-Bench provides a much-needed standardized evaluation for this domain, covering diverse tasks from simple attribute changes to complex reasoning.
RL training with EditScore improves not just prompt following but also the preservation of unedited regions (consistency).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics (policy, reward, online vs. offline)
Visual Language Models (VLMs) and their fine-tuning
Image editing concepts (instruction following, consistency)
Chain-of-thought reasoning

Key Terms

VLM: Visual Language Model—AI models that can process and generate both text and images.

PPO: Proximal Policy Optimization—an RL algorithm used to train the editing policy using the reward model's signal.

Best-of-N: A selection strategy where a model generates N candidates, and a reward model picks the best one.

Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer.

Semantic Consistency (SC): A metric evaluating if the edit followed the instruction and preserved unedited regions.

Perceptual Quality (PQ): A metric evaluating the photorealism and lack of artifacts in the image.

Self-ensemble: Running the model multiple times on the same input with stochastic sampling and averaging the results to reduce variance.

Online RL: Reinforcement learning where the model actively interacts with the environment (generation) and updates based on fresh feedback, as opposed to learning from a static dataset.