← Back to Paper List

Scaling Image and Video Generation via Test-Time Evolutionary Search

Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan
Hong Kong University of Science and Technology
arXiv.org (2025)
RL MM

📝 Paper Summary

Test-Time Scaling (TTS) Generative Models (Diffusion & Flow)
EvoSearch improves generative quality at inference time by treating the denoising trajectory as an evolutionary search, actively mutating intermediate latent states to discover high-reward samples without model retraining.
Core Problem
Existing test-time scaling methods for diffusion models (like Best-of-N or particle sampling) are inefficient or constrained to a fixed initial candidate pool, failing to actively explore the high-dimensional latent space.
Why it matters:
  • Training-time scaling is hitting limits due to data depletion and soaring computational costs, making inference-time compute a critical new frontier
  • Current search methods (Best-of-N) waste compute on low-quality trajectories and lack the ability to correct course or discover novel modes during sampling
  • Standard fine-tuning methods (RL, backprop) often lead to reward over-optimization and mode collapse, sacrificing sample diversity
Concrete Example: In a standard diffusion process, if the initial noise candidates all lead to mediocre images, methods like Particle Sampling can only re-weight these bad options. EvoSearch, however, would 'mutate' a promising intermediate state into a new, unmapped region of the latent space, potentially discovering a high-quality image that wasn't in the original pool.
Key Novelty
Evolutionary Search on Denoising Trajectories
  • Reformulates the sequential denoising process as an evolving population where 'offspring' are generated by mutating the latent states of high-performing 'parents'
  • Transforms deterministic Flow-ODE sampling into a stochastic SDE process to enable exploration and variation in flow-based models
  • Leverages the insight that high-quality latent states are clustered, using specialized mutation operators to explore neighborhoods of best-performing particles
Architecture
Architecture Figure Figure 3
Overview of the EvoSearch framework showing the evolution pipeline along the denoising trajectory.
Evaluation Highlights
  • Wan 1.3B model using EvoSearch achieves competitive performance with the 10x larger Wan 14B model (video generation)
  • Stable Diffusion 2.1 using EvoSearch surpasses GPT-4o on generation quality (implied human/reward preference)
  • Consistently outperforms Best-of-N and Particle Sampling baselines in diversity and quality across image and video tasks
Breakthrough Assessment
8/10
Offers a unified, training-free framework for test-time scaling that works across both diffusion and flow architectures. The claim of bridging a 10x model size gap via inference search is significant.
×