Multi-Reward as Condition for Instruction-based Image Editing

📝 Paper Summary

Instruction-based image editing Reward modeling for diffusion models Automated data curation

The paper improves image editing models by conditioning them on multi-perspective rewards (instruction following, detail preserving, generation quality) derived from GPT-4o, rather than filtering or refining the noisy training data directly.

Core Problem

Predominant image editing datasets (like InsPix2Pix) are generated by models not designed for editing, leading to noisy triplets with inaccurate instruction following, poor detail preservation, and generation artifacts.

Why it matters:

Noisy training data limits the performance of state-of-the-art editing models, causing them to hallucinate changes or fail to execute instructions.
Existing methods that try to filter data or use human feedback (like MagicBrush) are hard to scale or don't cover specific failure modes (following vs. preserving).
Standard reward mechanisms (like single-score rewards) are insufficient because image quality is multi-dimensional and CLIP encoders are insensitive to scalar reward values.

Concrete Example: In the InsPix2Pix dataset, an instruction 'make the glasses green' might result in a ground-truth image where the glasses remain unchanged but the background color shifts. Training on this teaches the model to ignore instructions and hallucinate background edits.

Key Novelty

Multi-Reward Condition (MRC) Framework

Instead of discarding noisy data, the model is trained with explicit 'quality labels' (rewards) as auxiliary inputs, teaching it to distinguish between good and bad editing behaviors.
Rewards are decomposed into three distinct perspectives: instruction following, detail preserving, and generation quality, each with a scalar score and text feedback.
At inference time, the model is guided to generate high-quality edits by manually setting these reward condition inputs to their maximum possible values (scores of 5).

Architecture

The Multi-Reward Condition (MRC) framework integrated into a Stable Diffusion editing pipeline.

Evaluation Highlights

+9.4% improvement in Instruction Following accuracy on the Real-Edit benchmark when adding Multi-Reward to InsPix2Pix (GPT-4o evaluation).
Outperforms HIVE (a feedback-based baseline) by +6.2% in Instruction Following and +0.33 in Detail Preserving score on Real-Edit.
Achieves state-of-the-art results across all three metrics (Following, Preserving, Quality) in human evaluation, surpassing SmartEdit and HIVE.

Breakthrough Assessment

7/10

A clever, pragmatic solution to the noisy data problem. Instead of expensive cleaning, it effectively 're-labels' noise as low-reward examples. Strong empirical results, though the core architectural change is an auxiliary conditioning branch.

⚙️ Technical Details

Problem Definition

Setting: Text-instruction-based image editing using diffusion models

Inputs: Original image x, text instruction t, and desired reward conditions (scores + feedback)

Outputs: Edited image y

Pipeline Flow

Input Processing: Image x and Instruction t encoded via VAE/Text Encoder
Multi-Reward Condition (MRC): Target scores (e.g., 5.0) and empty feedback text encoded into reward embeddings
Reward Integration: Reward embeddings injected into Latent Noise (via Attention) and U-Net blocks (via Addition)
Denoising: Diffusion model generates edited latent
Output Generation: VAE Decoder produces final image y

System Modules

MRC Module (Multi-Reward Condition)

Encodes separate reward signals (score + text) into a unified conditioning vector

Model or implementation: MLP for scores + Stable Diffusion Text Encoder for text feedback

Reward Encoder (Integration)

Integrates reward condition into the latent noise via cross-attention

Model or implementation: 11 Standard Transformer Encoder Blocks

U-Net Integrator (Integration)

Injects reward condition directly into U-Net intermediate layers

Model or implementation: Linear Projection Layers

Novel Architectural Elements

Dual-injection strategy: Reward embeddings are fed into BOTH the latent noise (via Transformer Encoder) and the U-Net blocks (via Linear Projection add-on)
Decomposed reward representation: Distinct encoding paths for quantitative scores (Positional Encoding + MLP) and qualitative feedback (Text Encoder) concatenated together

Modeling

Base Model: Stable Diffusion (v1.5) as the backbone, initialized from InsPix2Pix checkpoint

Training Method: Supervised Fine-Tuning with Auxiliary Reward Conditioning

Objective Functions:

Purpose: Minimize noise prediction error conditioned on image, text, and rewards.

Formally: L = E[|| ε - ε_θ(z_t, t, c_I, c_T, c_R) ||^2]

Adaptation: Full fine-tuning of U-Net with added reward modules

Trainable Parameters: U-Net + Reward Encoder + Projection Layers

Training Data:

RewardEdit-20K: 20,000 triplets from InsPix2Pix dataset annotated with GPT-4o scores (0-5) and text feedback

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 256
image_resolution: 256x256
+ 1 more
steps: 15,000 to 20,000

Compute: 8 NVIDIA A100 GPUs for roughly 16 hours

Comparison to Prior Work

vs. InsPix2Pix: Adds multi-perspective reward conditioning to handle noisy data
vs. HIVE: Decomposes reward into 3 aspects (following, preserving, quality) and includes text feedback, whereas HIVE uses a single scalar score
vs. SmartEdit: Compatible as a plug-and-play module (shown in experiments) rather than a competing architecture

Limitations

Relies on GPT-4o for reward annotation, which incurs API costs and inherits potential biases of the VLM.
Inference requires manually setting 'ideal' reward scores (e.g., 5), which is a heuristic control knob.
The approach increases model parameter count due to the additional reward encoder and projection layers.
Does not fix the underlying data quality, only teaches the model to navigate it.

Reproducibility

Code: https://github.com/bytedance/Multi-Reward-Editing

Code is publicly available. The RewardEdit-20K dataset (20k samples) is constructed using GPT-4o. Training uses 8 A100s. The paper details prompts used for GPT-4o annotation in the appendix/figures.

📊 Experiments & Results

Evaluation Setup

Editing real-world photos based on diverse text instructions

Benchmarks:

Real-Edit (Real-world Image Editing) [New]
MagicBrush Test Set (Local/Mask-based Editing)

Metrics:

Instruction Following (Accuracy/Score)
Detail Preserving (Score 0-5)
Generation Quality (Score 0-5)
L1/CLIP-I/DINO (Pixel/Feature distances)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-4o evaluation on Real-Edit benchmark shows Multi-Reward conditioning significantly improves instruction following and detail preservation over the InsPix2Pix baseline.
Real-Edit	Following (Accuracy)	47.7	57.1	+9.4
Real-Edit	Preserving (Score 0-5)	4.21	4.50	+0.29
Real-Edit	Quality (Score 0-5)	4.33	4.41	+0.08
Human evaluation confirms the improvements seen in automated metrics, with the proposed method achieving the best scores across all three categories.
Real-Edit (Subset)	Following (Score 0-5)	3.37	4.08	+0.71
Real-Edit (Subset)	Preserving (Score 0-5)	3.67	4.00	+0.33
Compatibility experiments show the Multi-Reward framework also improves the stronger SmartEdit baseline.
Real-Edit	Following (Accuracy)	61.3	63.0	+1.7

Experiment Figures

Visual comparison of editing results on real-world images.

Statistics of the RewardEdit-20K dataset.

Main Takeaways

Explicitly conditioning on 'quality' (rewards) allows models to learn from imperfect data by distinguishing good examples from bad ones.
Decomposing quality into 'Following', 'Preserving', and 'Quality' prevents the model from trading off one for another (e.g., ignoring instructions to preserve details).
Textual feedback in the reward condition provides granular supervision that scalar scores alone cannot capture.
The method is model-agnostic and improves both standard (InsPix2Pix) and VLM-enhanced (SmartEdit) editing pipelines.

📚 Prerequisite Knowledge

Prerequisites

Diffusion models (Stable Diffusion)
Instruction-based editing (InstructPix2Pix)
Vision-Language Models (GPT-4o/LLaVA) for evaluation

Key Terms

InsPix2Pix: InstructPix2Pix—a pioneering method and dataset for instruction-based image editing that uses a diffusion model fine-tuned on generated image pairs

MRC: Multi-Reward Condition—the proposed module that encodes reward scores and text descriptions into embeddings to guide the diffusion process

U-Net: The neural network architecture used in Stable Diffusion for denoising images

CLIP: Contrastive Language-Image Pre-training—a model used to encode text and images into a shared vector space, often used for conditioning diffusion models

LLaVA: Large Language-and-Vision Assistant—an open-source large multimodal model used by SmartEdit for better instruction understanding

GPT-4o: A large multimodal model from OpenAI used in this paper to generate reward scores and text feedback for training data

VAE: Variational Autoencoder—used to compress images into a latent space for efficient diffusion training