← Back to Paper List

InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, Guo-Jun Qi
MAPLE Lab, Westlake University
arXiv.org (2025)
MM Memory RL Benchmark

📝 Paper Summary

Video Generation Autoregressive Modeling Reinforcement Learning for Generation
InfLVG enables coherent long video generation by using a reinforcement learning-trained policy to dynamically select the most relevant past frames during inference, ensuring consistency across scenes without retraining the base video model.
Core Problem
Naive autoregressive video extension fails to handle cross-scene transitions because accumulating history eventually dominates the model's attention, making it ignore new prompts or stick to old scene semantics.
Why it matters:
  • Generating long, multi-scene videos is computationally expensive due to quadratic scaling of attention with sequence length
  • Current models trained on short clips struggle to generalize to long narratives, often losing subject identity or failing to follow changing text prompts over time
  • Extending context windows naively introduces noise and irrelevant features that degrade generation quality
Concrete Example: When extending a video of a woman walking in a street to a new scene where she enters a cafe, standard autoregressive models often keep generating the street background despite the new prompt, or distort her face because they attend to too many irrelevant past tokens.
Key Novelty
Inference-time Context Selection Policy via GRPO (Group Relative Policy Optimization)
  • Instead of using all past frames or a fixed sliding window, a lightweight policy network predicts which specific past tokens are most relevant for the current generation step
  • This policy is trained using reinforcement learning (GRPO) to maximize rewards for identity preservation, text alignment, and visual quality, without altering the heavy base video generator
  • Uses a Top-K ranking mechanism to select a fixed budget of context tokens, keeping computational costs constant regardless of total video length
Architecture
Architecture Figure Figure 3
The inference-time pipeline showing how the Context Selection Policy interacts with the video generation process.
Evaluation Highlights
  • Extends video generation length by up to 9× compared to standard autoregressive baselines while maintaining higher consistency
  • Outperforms sliding-window approaches (e.g., FreeNoise) on the proposed CsVBench (Cross-scene Video Benchmark) in both subject consistency and prompt alignment
  • Reduces inference latency compared to full-context attention by maintaining a fixed kv-cache size via the top-K selection mechanism
Breakthrough Assessment
7/10
Offers a practical, inference-only solution to the context length problem in video generation. While relying on existing RL techniques (GRPO), applying them to dynamic KV-cache selection for video is a novel and effective application.
×