Scaling Image and Video Generation via Test-Time Evolutionary Search

📝 Paper Summary

Test-Time Scaling (TTS) Generative Models (Diffusion & Flow)

EvoSearch improves generative quality at inference time by treating the denoising trajectory as an evolutionary search, actively mutating intermediate latent states to discover high-reward samples without model retraining.

Core Problem

Existing test-time scaling methods for diffusion models (like Best-of-N or particle sampling) are inefficient or constrained to a fixed initial candidate pool, failing to actively explore the high-dimensional latent space.

Why it matters:

Training-time scaling is hitting limits due to data depletion and soaring computational costs, making inference-time compute a critical new frontier
Current search methods (Best-of-N) waste compute on low-quality trajectories and lack the ability to correct course or discover novel modes during sampling
Standard fine-tuning methods (RL, backprop) often lead to reward over-optimization and mode collapse, sacrificing sample diversity

Concrete Example: In a standard diffusion process, if the initial noise candidates all lead to mediocre images, methods like Particle Sampling can only re-weight these bad options. EvoSearch, however, would 'mutate' a promising intermediate state into a new, unmapped region of the latent space, potentially discovering a high-quality image that wasn't in the original pool.

Key Novelty

Evolutionary Search on Denoising Trajectories

Reformulates the sequential denoising process as an evolving population where 'offspring' are generated by mutating the latent states of high-performing 'parents'
Transforms deterministic Flow-ODE sampling into a stochastic SDE process to enable exploration and variation in flow-based models
Leverages the insight that high-quality latent states are clustered, using specialized mutation operators to explore neighborhoods of best-performing particles

Architecture

Overview of the EvoSearch framework showing the evolution pipeline along the denoising trajectory.

Evaluation Highlights

Wan 1.3B model using EvoSearch achieves competitive performance with the 10x larger Wan 14B model (video generation)
Stable Diffusion 2.1 using EvoSearch surpasses GPT-4o on generation quality (implied human/reward preference)
Consistently outperforms Best-of-N and Particle Sampling baselines in diversity and quality across image and video tasks

Breakthrough Assessment

8/10

Offers a unified, training-free framework for test-time scaling that works across both diffusion and flow architectures. The claim of bridging a 10x model size gap via inference search is significant.

⚙️ Technical Details

Problem Definition

Setting: Sampling from a target distribution p_tar(x0) ∝ p_pre(x0) * exp(r(x0)/α) that optimizes a reward function r while staying close to the pre-trained distribution

Inputs: Pre-trained diffusion or flow model, Reward function r(x), Initial Gaussian noise

Outputs: Optimized image or video sample x0

Pipeline Flow

Population Initialization (Gaussian Noise)
Initial Noise Search (Evolution at t=T)
Iterative Denoising & Evolution (Loop from T to 0)

System Modules

ODE-to-SDE Transformer

Converts deterministic flow ODEs into stochastic SDEs to allow for exploration during sampling

Model or implementation: Mathematical transformation (Eq. 3)

Reward Evaluator (Evolution)

Calculates fitness scores for current population candidates

Model or implementation: Off-the-shelf reward model (e.g., Human Preference, VLM)

Evolution Engine (Evolution)

Performs selection and mutation on the population of latent states

Model or implementation: Tournament Selection + SDE-aware Mutation

Novel Architectural Elements

Dynamic evolutionary search embedded directly into the denoising trajectory of generative models
Reverse-time SDE-inspired mutation operator that perturbs intermediate latent states while preserving manifold structure

Modeling

Base Model: Evaluated on Stable Diffusion 2.1 (Image) and Wan 1.3B (Video)

Training Method: Inference-time optimization only (no gradient updates to generative model)

Adaptation: None (Model weights are frozen)

Trainable Parameters: None

Key Hyperparameters:

evolution_schedule: Set of timesteps T to t_n where search occurs
population_schedule: List of population sizes k for each search step

Comparison to Prior Work

vs. Best-of-N: EvoSearch actively improves samples during generation via mutation rather than just filtering final outputs
vs. Particle Sampling: EvoSearch introduces novel diversity via mutation, whereas particle sampling suffers from degeneracy (diversity collapse)
vs. RL/Fine-tuning: EvoSearch requires no training/gradients and avoids reward hacking/mode collapse common in RL
+ 1 more
vs. Video-T1: EvoSearch is a generalist framework for Diffusion/Flow models, not limited to autoregressive architectures

Limitations

Computational cost increases with the number of search steps and population size
Requires an accurate and fast reward model to guide the search effectively
Flow model adaptation requires converting efficient ODE solvers to stochastic SDE solvers, potentially slowing down baseline inference

Reproducibility

Code: https://tinnerhrhe.github.io/evosearch

Project website provided (tinnerhrhe.github.io/evosearch). Paper describes algorithms and math in detail. Specific hyperparameter values (beta, population sizes) for the experiments are not explicitly listed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Text-conditioned image and video generation optimizing for specific rewards

Benchmarks:

Text-to-Image Generation (Generation)
Text-to-Video Generation (Generation)

Metrics:

Sample Quality (Reward Score)
Diversity
Human-preference alignment
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of multimodal target distribution coverage between Retraining (RL), Search (Best-of-N), and EvoSearch.

t-SNE visualization of latent states colored by their generation quality.

Main Takeaways

EvoSearch effectively scales test-time compute to improve generation quality, bridging the gap between smaller and much larger models (e.g., 1.3B vs 14B)
The method prevents diversity collapse better than Particle Sampling by actively exploring new regions of the latent space via mutation
Optimizing intermediate states is shown to be effective because high-quality latent states are clustered in the reward landscape (validated via t-SNE)

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (SDE vs ODE sampling)
Flow Matching / Flow-based models
Evolutionary Algorithms (Selection, Mutation, Population)
Importance Sampling

Key Terms

TTS: Test-Time Scaling—improving model performance by increasing computation during inference (e.g., search or verification) rather than training

SDE: Stochastic Differential Equation—a differential equation involving a random noise term, used here to model the diffusion process

ODE: Ordinary Differential Equation—a deterministic differential equation, often used for sampling in flow models

Flow Models: Generative models that learn a velocity field to transform a simple distribution (noise) into data via an ODE

Best-of-N: A simple search strategy that generates N samples and selects the one with the highest reward

Particle Sampling: A sequential Monte Carlo method that resamples trajectories during the denoising process based on intermediate rewards

Tournament Selection: An evolutionary operator where a small subset of the population is chosen at random, and the best individual from that subset becomes a parent