Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

📝 Paper Summary

Unified Multimodal Models (UMMs) Prompt Optimization / Reprompting Visual Generation Alignment

SEER bridges the cognitive gap in unified multimodal models by turning their latent understanding into an active reasoning step that rewrites user prompts to align with the generator's internal priors.

Core Problem

Unified Multimodal Models (UMMs) understand visual instructions well but fail to follow them during image generation because they lack a mechanism to translate high-level understanding into generator-friendly descriptions.

Why it matters:

Despite sharing architectures, current models exhibit a 'cognitive gap': they can critique an image perfectly but cannot generate it correctly
Prior reprompting methods use disjoint models (e.g., GPT-4 rewriter + Stable Diffusion), causing 'representation mismatch' where rewrites are linguistically valid but visually unrealizable by the specific generator
Standard RLHF for vision optimizes pixels directly, which fails to teach the model *how* to reason about the generation process itself

Concrete Example: Given the instruction 'Make the object look scary, yet undeniably cute', a standard model might generate a generic scary object. SEER reasoning explicitly translates this into concrete descriptors like 'big eyes, fluffy texture, sharp teeth' that the specific generator knows how to render.

Key Novelty

Endogenous Reprompting via Self-Evolving Evaluator and Reprompter (SEER)

Transforms the model's passive understanding into an active 'reprompting' step within the *same* shared parameter space, ensuring the generated prompt aligns with what the generator can actually draw
Uses a two-stage loop: first training the model to be a verifiable evaluator (RLVR), then using that evaluator to train the model to think and rewrite prompts (RLMT), all using only 300 seed samples

Architecture

The SEER framework workflow. It shows the transition from a passive UMM to an active reasoning agent.

Evaluation Highlights

Outperforms state-of-the-art baselines like Emu3 and Janus-Pro in instruction adherence and visual quality.
Achieves self-evolution using only 300 samples from a compact proxy task, compared to thousands used in standard fine-tuning.
Demonstrates that optimizing the reasoning (prompt) is more effective than optimizing the execution (pixels) for unlocking latent generative capabilities.

Breakthrough Assessment

8/10

Innovative use of 'endogenous' feedback loops (using the model to teach itself) to solve the alignment problem without massive external supervision or disjoint models. The low data requirement (300 samples) is particularly striking.

⚙️ Technical Details

Problem Definition

Setting: Visual Instruction Elaboration: optimizing a policy to generate a reprompt p that satisfies instruction a while maintaining prompt p0

Inputs: Visual instruction a (e.g., edit request) and initial minimal prompt p0

Outputs: Reprompt p and generated image x_pol

Pipeline Flow

Reprompting Policy (generates reasoning z and reprompt p)
Generator (fixed, generates image from p)
Evaluator (judges image compliance)

System Modules

Reprompting Policy

Generates explicit reasoning trace and final reprompt based on instruction

Model or implementation: UMM (Understanding/Reasoning parameters θ optimized)

Generator

Maps textual prompts to pixels

Model or implementation: UMM (Generation parameters φ frozen)

Evaluator

Acts as internal reward model comparing generated image vs baseline

Model or implementation: UMM (Understanding parameters θ, acting as judge)

Novel Architectural Elements

Endogenous Loop: The Evaluator and Reprompter share the *same* parameters (θ), allowing knowledge transfer between understanding and generation logic
Two-stage self-evolution: RLVR activates the evaluator → Evaluator provides reward for RLMT to train the reprompter

Modeling

Base Model: Unified Multimodal Model (architecture implies specific UMMs like Emu3 or Janus, though 'UMM' is used generally)

Training Method: Two-stage RL: RLVR (Curriculum) then RLMT

Objective Functions:

Purpose: Maximize expected reward while staying close to reference policy.

Formally: GRPO objective maximizing E[A_i * log π(u|q)] - β * D_KL
Purpose: Define endogenous reward based on comparison with naive baseline.

Formally: r_i = I(x_pol > x_ref) + I(valid format)
Purpose: Penalize reprompts that drift from generator priors.

Formally: Implicit penalty via shared parameter space where Evaluator penalizes p not in Z_gen

Training Data:

300 samples (Visual Instruction Elaboration dataset)
Split into Simple Instructions (Material, Perspective, Semantic) and Hard Instructions (Attribute, Constraint, Conceptual Reasoning)

Key Hyperparameters:

sample_size: 300 samples
RL_algorithm: GRPO

Comparison to Prior Work

vs. DALL-E 3: SEER is endogenous (same model rewrites and generates), avoiding representation mismatch
vs. ImageReward/DPO: SEER optimizes the *reasoning* (prompt) via RLMT, not the execution (pixels), teaching the model to 'think' before generating
vs. PromptEnhancer: Uses 300 samples vs large datasets; relies on internal knowledge activation rather than external supervision

Limitations

Relies on the base UMM having sufficient latent knowledge to be 'woken up'; cannot teach completely new concepts.
Generator parameters are frozen, so fundamental rendering failures cannot be fixed, only circumvented via better prompting.
Evaluation relies heavily on the model's own judgment capabilities (Endogenous Evaluator), which could suffer from self-bias if not carefully calibrated.

Reproducibility

Code: https://2kxx.github.io/SEER.github.io/

Code and project page available at https://2kxx.github.io/SEER.github.io/. Dataset of 300 samples provided. Uses compact proxy task rather than massive datasets.

📊 Experiments & Results

Evaluation Setup

Visual Instruction Elaboration task involving reprompting and generation.

Benchmarks:

Visual Instruction Elaboration Benchmark (Text-to-Image Generation with specific constraints) [New]

Metrics:

Evaluation Accuracy (of the internal evaluator)
Reprompting Efficiency
Generation Quality (Instruction Compliance)
Statistical methodology: Pairwise comparison against baselines (win-rate)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Visual Instruction Elaboration	Win Rate vs Baseline	50.0	High positive win rate (Qualitative statement, exact number not in text but clearly superior in figures)	Positive

Experiment Figures

Detailed training pipeline for RLVR (Stage 1) and RLMT (Stage 2).

Main Takeaways

SEER consistently outperforms state-of-the-art baselines in instruction compliance and generation quality.
The method requires minimal data (300 samples), proving that the capability is 'endogenous' (woken up) rather than learned from scratch.
Optimizing the reasoning process (reprompting) is more effective for UMMs than optimizing low-level execution pixels.
The endogenous loop ensures model-specific alignment, reducing the generation of impossible descriptions.

📚 Prerequisite Knowledge

Prerequisites

Unified Multimodal Models (UMMs)
Reinforcement Learning from Human Feedback (RLHF)
Prompt Engineering / Optimization

Key Terms

UMM: Unified Multimodal Model—a single model handling both understanding (image-to-text) and generation (text-to-image) within shared parameters

RLVR: Reinforcement Learning with Verifiable Rewards—training method where the reward is a binary correctness check (verifiable) rather than a learned scalar score

RLMT: Reinforcement Learning with Model-rewarded Thinking—training method where the model generates a 'reasoning trace' before the final answer, rewarded by a learned model

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training without a separate value network

Endogenous Reprompting: A mechanism where the model uses its own internal understanding to rewrite prompts for itself, ensuring the new prompts match its own generative capabilities

Cognitive Gap: The discrepancy between a model's high performance in understanding tasks (e.g., VQA) and low performance in generation tasks using the same knowledge

Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their scores

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution