← Back to Paper List

Advances in GRPO for Generation Models: A Survey

Zexiang Liu, Xianglong He, Yangguang Li
arXiv (2026)
RL MM Speech Benchmark

📝 Paper Summary

Reinforcement Learning for Generative Models Alignment of Flow Matching Models
This survey reviews Flow-GRPO, a framework adapting Group Relative Policy Optimization to continuous generative models by converting deterministic flow matching ODEs into stochastic SDEs for stable reinforcement learning alignment.
Core Problem
Large-scale flow matching models generate high-quality outputs but struggle with alignment to human preferences because their deterministic sampling prevents exploration and rewards are often sparse (only available at the end).
Why it matters:
  • Standard flow matching uses deterministic ODE solvers, lacking the stochasticity required for policy gradient methods to explore and learn
  • Rewards in visual tasks (e.g., image aesthetics) are typically given only for the final image, creating a severe credit assignment problem where intermediate steps receive identical, noisy feedback
  • Optimizing for fixed reward models leads to 'reward hacking' and mode collapse, where models generate high-scoring but visually degraded or repetitive outputs
Concrete Example: In standard Flow-GRPO, if a generated image gets a low score due to a bad detail added in the final steps, the negative feedback is unfairly applied to early steps that correctly established the global structure. Conversely, DenseGRPO predicts a clean image at every step to isolate exactly when the quality dropped.
Key Novelty
Flow-GRPO Ecosystem and Taxonomy
  • Systematizes the rapid expansion of Flow-GRPO methods into categories like dense reward design, credit assignment, and training acceleration
  • Highlights the core innovation of Flow-GRPO: injecting stochasticity into deterministic flow matching via SDEs to enable 'critic-free' relative policy optimization
  • Contrasts different approaches to credit assignment, such as tree-search branching (TreeGRPO) versus process reward injection (Euphonium), to solve the sparse reward problem
Evaluation Highlights
  • Flow-GRPO improves GenEval text-rendering accuracy from 63% to 95% and character rendering from 59% to 92% over baseline flow matching models
  • DiffusionNFT achieves a 25x speedup over standard Flow-GRPO by performing online RL on the forward noising process rather than reverse denoising
  • DisCo improves Unique Face Accuracy in multi-human generation to 98.6% (vs <50% baseline) by using a compositional reward to penalize facial similarity
Breakthrough Assessment
9/10
Comprehensive survey of a rapidly emerging field (200+ papers since mid-2025). Effectively categorizes crucial innovations in RL alignment for continuous generative models.
×