EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

📝 Paper Summary

Instruction-guided image editing Reward modeling / Preference learning

EditReward is a multi-dimensional, uncertainty-aware reward model trained on 200K expert-annotated image editing pairs, enabling the selection of high-quality training data to improve open-source editing models.

Core Problem

Existing reward models for image editing (LPIPS, CLIP, general VLMs) align poorly with human preferences, failing to reliably distinguish high-quality edits for data filtering.

Why it matters:

Current open-source editing models lag behind closed-source ones (like GPT-Image-1) due to a lack of high-quality training data
Scaling up training data requires automatic filtering, but current rewards are too noisy or biased to serve as reliable filters
Crowd-sourced preference datasets often suffer from low inter-annotator agreement and inconsistency

Concrete Example: A tie in overall quality often masks trade-offs: one edit might follow instructions perfectly but have artifacts, while another is pretty but ignores the prompt. Standard models treat this as a generic tie, losing the nuanced signal that humans provided.

Key Novelty

Multi-Dimensional Uncertainty-Aware Reward Modeling

Decomposes edit quality into 'Instruction Following' and 'Visual Quality', predicting Gaussian distributions for each to capture annotator uncertainty
Uses a novel tie-disentanglement strategy: splits 'tied' pairs into two conflicting training samples (A>B on dimension X, B>A on dimension Y) to learn granular trade-offs
Trains on a new dataset (EditReward-Data) of 200K expert-annotated pairs rather than noisy crowd-sourced labels

Architecture

Conceptual flow of the reward calculation: VLM backbone extracts features, which are fed into MLP heads predicting Gaussian distributions for two dimensions (IF and VQ).

Evaluation Highlights

Achieves 65.72% accuracy on GenAI-Bench, significantly outperforming GPT-5 (59.61%) and GPT-4o
Scores 63.62% on AURORA-Bench, surpassing OpenAI-GPT-4o (50.81%) by a large margin
Fine-tuning Step1X-Edit on just the top 20K samples filtered by EditReward improves Overall Score from 6.78 to 7.086, matching state-of-the-art Doubao-Edit

Breakthrough Assessment

9/10

Establishes a new SOTA for editing reward models, beating GPT-5. The release of 200K expert annotations and the methodology for disentangling ties/uncertainty are significant contributions to the field.

⚙️ Technical Details

Problem Definition

Setting: Reward modeling for instruction-guided image editing

Inputs: Source image I_s, Text prompt P, Edited image I_e

Outputs: Scalar reward score s reflecting human preference

Pipeline Flow

Input Processing (tri-modal input)
Feature Extraction (VLM backbone)
Reward Prediction (Multi-head MLP)
Aggregation (Final Score)

System Modules

Multimodal Backbone

Extracts latent representation of the edit quality from source image, prompt, and edited image

Model or implementation: Qwen2.5-VL-7B or MiMo-VL-7B

Reward Head (MLP)

Projects latent features into Gaussian parameters for two dimensions: Instruction Following and Visual Quality

Model or implementation: Multi-Layer Perceptron

Aggregator

Combines dimensional distributions into a single preference probability or score

Model or implementation: Weighted Sum / Integration

Novel Architectural Elements

Multi-dimensional output head predicting separate Gaussian distributions for Instruction Following and Visual Quality
Tie-disentanglement mechanism in the data pipeline that splits tied pairs into conflicting preference samples based on dimensional strengths

Modeling

Base Model: Qwen2.5-VL-7B and MiMo-VL-7B

Training Method: Supervised Fine-Tuning with Ranking Loss

Objective Functions:

Purpose: Optimize the model to rank images according to human preference while accounting for uncertainty.

Formally: Minimize negative log-likelihood of preference P(I_h > I_l) computed by integrating over the aggregated Gaussian reward distributions.

Training Data:

EditReward-Data: 200K manually annotated preference pairs
Source images from GEdit, ImgEdit, MagicBrush, AnyEdit, EmuEdit
Annotated on 4-point Likert scale for IF and VQ

Key Hyperparameters:

learning_rate: 2e-6
batch_size: 16 (effective)
epochs: 2
+ 2 more
warmup_ratio: 0.05
schedule: cosine

Compute: Cluster of 8 NVIDIA A800 GPUs. Scoring 46K samples took 2.61 GPU hours (0.25 seconds/sample).

Comparison to Prior Work

vs. HPSv3: EditReward extends uncertainty modeling to multi-dimensional outputs (IF and VQ) specific to editing, whereas HPSv3 uses a single holistic score.
vs. GPT-4o/GPT-5: EditReward is a specialized smaller model (7B) fine-tuned on expert data, outperforming these larger proprietary models on editing benchmarks.
vs. Qwen2.5-VL-7B (Base): Our fine-tuning improves GenAI-Bench accuracy by +23 points (40.48% -> 63.97%).

Limitations

Relies on the quality of the VLM backbone feature extraction
Trained specifically for image editing; generalization to generation tasks not explored
No statistical significance tests reported for benchmark improvements

Reproducibility

EditReward-Data (200K pairs) and the trained EditReward model will be released. Code URL not provided in text. Hyperparameters are detailed. Datasets used for evaluation are public benchmarks.

📊 Experiments & Results

Evaluation Setup

Evaluated as a reward model (judge) predicting human preferences on held-out benchmarks

Benchmarks:

GenAI-Bench (Pairwise preference prediction)
AURORA-Bench (Pairwise preference prediction)
ImagenHub (Point-wise scoring correlation)
EditReward-Bench (Multi-way preference ranking (ternary/quaternary)) [New]

Metrics:

Prediction Accuracy
Spearman Correlation
Overall Score (G_O)
Statistical methodology: Krippendorff’s Alpha reported for inter-annotator agreement. No significance tests reported for model comparisons.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EditReward demonstrates state-of-the-art performance across multiple public benchmarks, outperforming both open-source and proprietary models.
GenAI-Bench	Accuracy	59.61	65.72	+6.11
AURORA-Bench	Accuracy	50.81	63.62	+12.81
EditReward-Bench	Accuracy	38.02	38.42	+0.40
ImagenHub	Spearman Correlation	36.56	36.18	-0.38
Data filtering experiments show that training on a high-quality subset selected by EditReward outperforms training on the full noisy dataset.
GEdit-Bench	Overall Score (G_O)	6.780	7.086	+0.306

Main Takeaways

Expert-annotated data is superior: Fine-tuning a 7B model on high-quality data beats GPT-5 on alignment benchmarks.
Multi-dimensional uncertainty modeling works: The disentangled reward heads effectively capture the nuanced trade-off between instruction following and visual quality.
Effective Data Filter: EditReward can filter noisy synthetic datasets (ShareGPT-4o-Image), enabling smaller subsets (20K) to train better models than the full dataset (46K).
Visual Quality is more subjective: Inter-annotator agreement for Instruction Following (0.668) is higher than Visual Quality (0.597), justifying the multi-head approach.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) as feature extractors
Reinforcement Learning from Human Feedback (RLHF) concepts
Gaussian distributions and probability density functions

Key Terms

VLM: Vision-Language Model—a model that can process and reason about both images and text

LPIPS: Learned Perceptual Image Patch Similarity—a metric measuring perceptual similarity between two images

CLIP: Contrastive Language-Image Pre-training—a model aligned to match images with text descriptions, often used for semantic scoring

Likert scale: A rating scale used in questionnaires (e.g., 1 to 4) to quantify subjective opinions

Krippendorff’s Alpha: A statistical measure of the agreement (reliability) among annotators

SFT: Supervised Fine-Tuning—training a model on a labeled dataset

OOD: Out-of-Distribution—data that differs significantly from the data seen during training

Instruction Following (IF): How accurately the edited image reflects the text instructions

Visual Quality (VQ): How realistic and artifact-free the edited image looks

HPSv3: Human Preference Score v3—a prior text-to-image reward model that introduced uncertainty-aware ranking