OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

📝 Paper Summary

Mask-guided image generation Reinforcement Learning from Human Feedback (RLHF) for vision Multi-task learning

OneReward aligns a single image generation model across multiple tasks and conflicting evaluation metrics by using one Vision-Language Model as a reward judge that conditions its feedback on task-specific text queries.

Core Problem

Existing mask-guided generation models rely on task-specific fine-tuning that limits generalization, while current RLHF methods (like DPO) struggle to resolve conflicting preferences across different metrics (e.g., aesthetics vs. structure).

Why it matters:

Training separate models for inpainting, outpainting, and removal is inefficient and prevents knowledge transfer across related tasks
Standard preference optimization (DPO) fails when an image wins on one metric but loses on another, as it assumes a single global preference ordering
Previous reward-based methods (ReFL) require training distinct reward models for each metric (fidelity, safety, etc.), increasing complexity

Concrete Example: In image editing, a generated output might have perfect aesthetic quality but fail to preserve the structural lines of the background. A standard DPO approach cannot easily label this as a clear 'winner' or 'loser' without metric-specific granularity, leading to ambiguous training signals.

Key Novelty

OneReward: Generative VLM as a Multi-Task Reward Model

Uses a single pre-trained Vision-Language Model (VLM) to serve as the reward model for all tasks, rather than training separate scalar reward heads
Conditions the VLM with a text query encoding the specific task (e.g., 'object removal') and metric (e.g., 'consistency'), allowing it to output metric-specific 'Yes/No' preference probabilities
Eliminates Supervised Fine-Tuning (SFT) by applying Reinforcement Learning directly to the pre-trained base model using these unified reward signals

Architecture

The OneReward training framework pipeline

Evaluation Highlights

Reward model achieves 84.93% accuracy in aligning with human judgments for object removal quality
Reward model exceeds 80% accuracy for text alignment evaluation in both image fill and image extend tasks
Proposed model (Seedream 3.0 Fill) is claimed to outperform commercial competitors like Adobe Photoshop and Ideogram on alignment and aesthetics (qualitative claim, exact metrics not in snippet)

Breakthrough Assessment

8/10

Proposes a unified RL framework that successfully handles conflicting multi-objective optimization in vision generation without SFT, a significant methodological simplification over prior multi-model pipelines.

⚙️ Technical Details

Problem Definition

Setting: Multi-task mask-guided image generation (inpainting, outpainting, removal, text rendering)

Inputs: Source image I_src, Binary mask M, Text prompt P

Outputs: Edited image x_0 consistent with prompt and context

Pipeline Flow

Input Processing (Image, Mask, Prompt)
Policy Model Generation (Flow Matching)
Reward Evaluation (VLM-based)
Optimization (RL Update)

System Modules

Policy Model (Generation)

Generates the edited image content within the masked area

Model or implementation: Seedream 3.0 (Flow Matching) or FLUX Fill

Reference Model (Generation)

Provides a baseline generation to compare against for KL divergence constraints

Model or implementation: Frozen copy of pre-trained base model

Reward Model (OneReward)

Evaluates the quality of generated images relative to reference/loser images based on a specific metric

Model or implementation: Pre-trained VLM

Novel Architectural Elements

Integration of a generative VLM as a dynamic reward function that changes evaluation criteria based on input text queries, replacing static scalar reward heads

Modeling

Base Model: Seedream 3.0 (based on Flow Matching / Rectified Flow)

Training Method: Reinforcement Learning (ReFL-style) with VLM reward

Objective Functions:

Purpose: Maximize the probability that the generated image is preferred over the reference/loser image according to the VLM.

Formally: Cross-entropy loss on the VLM's 'Yes'/'No' token probability.

Adaptation: Full model update (Policy Model)

Training Data:

Human preference dataset containing triplets (Source, Mask, Prompt) and candidate images
Annotated with Best-of-N and Worst-of-N selections across dimensions: Structure, Consistency, Text Alignment, Aesthetic, Removal Quality

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPO: OneReward handles multi-dimensional/conflicting metrics (e.g., aesthetic vs. structure) via explicit query conditioning, whereas DPO assumes a global ranking
vs. ReFL: Uses a single unified VLM for all metrics instead of training separate reward models for each metric
vs. FlowGRPO: Explicitly maximizes a reward signal during optimization rather than relying on policy-based advantage estimation

Limitations

VLM inference for reward calculation adds computational overhead compared to lightweight scalar reward heads
Depends on the quality of the underlying VLM's visual perception capabilities
Full quantitative generation results (e.g., win-rates vs Photoshop) are not extractable from the provided text snippet

Reproducibility

Code: https://one-reward.github.io

Code and model page available at https://one-reward.github.io. The paper uses Seedream 3.0 and FLUX Fill [dev] as base models. The dataset construction methodology involves randomizing inference parameters (steps, guidance scale) to create diverse candidates.

📊 Experiments & Results

Evaluation Setup

Multi-task evaluation across Image Fill, Image Extend, Object Removal, and Text Rendering

Benchmarks:

Internal Human Preference Dataset (Pairwise comparison) [New]

Metrics:

Reward Model Accuracy (alignment with human labels)
Generation Quality (Structure, Consistency, Text Alignment, Aesthetic, Removal Quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Test Set	Accuracy (Removal Quality)	100.00	84.93	-15.07
Internal Test Set	Accuracy (Text Alignment)	100.00	80.00	-20.00

Experiment Figures

Data annotation protocol showing Best-of-N and Worst-of-N selection across multiple dimensions

Main Takeaways

The VLM-based reward model generalizes well across tasks, achieving highest accuracy in Text Alignment (>80%) and Object Removal (84.93%).
Intrinsic visual quality metrics like Structure and Consistency are harder to predict, with reward model accuracy in the low–mid 70% range.
The unified framework allows optimizing a single model for conflicting objectives (e.g., removal vs. fill) by conditioning the reward signal on the specific task definition.

📚 Prerequisite Knowledge

Prerequisites

Flow Matching / Diffusion Models
Reinforcement Learning from Human Feedback (RLHF)
Vision-Language Models (VLMs)

Key Terms

VLM: Vision-Language Model—a model capable of processing both image and text inputs to generate text outputs

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences

ReFL: Reward Feedback Learning—a method to fine-tune diffusion models by backpropagating reward gradients through the denoising process

DPO: Direct Preference Optimization—an alignment method that optimizes policies directly on preference pairs without an explicit reward model

Flow Matching: A generative modeling paradigm that learns a velocity field to transport a prior distribution to the data distribution, often more efficient than standard diffusion

SFT: Supervised Fine-Tuning—training on labeled input-output pairs

OneReward: The proposed framework using a VLM to generate task-aware reward signals via textual queries

Inpainting: Filling in a missing or masked region of an image

Outpainting: Extending an image beyond its original borders