InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

📝 Paper Summary

Video Generation Autoregressive Modeling Reinforcement Learning for Generation

InfLVG enables coherent long video generation by using a reinforcement learning-trained policy to dynamically select the most relevant past frames during inference, ensuring consistency across scenes without retraining the base video model.

Core Problem

Naive autoregressive video extension fails to handle cross-scene transitions because accumulating history eventually dominates the model's attention, making it ignore new prompts or stick to old scene semantics.

Why it matters:

Generating long, multi-scene videos is computationally expensive due to quadratic scaling of attention with sequence length
Current models trained on short clips struggle to generalize to long narratives, often losing subject identity or failing to follow changing text prompts over time
Extending context windows naively introduces noise and irrelevant features that degrade generation quality

Concrete Example: When extending a video of a woman walking in a street to a new scene where she enters a cafe, standard autoregressive models often keep generating the street background despite the new prompt, or distort her face because they attend to too many irrelevant past tokens.

Key Novelty

Inference-time Context Selection Policy via GRPO (Group Relative Policy Optimization)

Instead of using all past frames or a fixed sliding window, a lightweight policy network predicts which specific past tokens are most relevant for the current generation step
This policy is trained using reinforcement learning (GRPO) to maximize rewards for identity preservation, text alignment, and visual quality, without altering the heavy base video generator
Uses a Top-K ranking mechanism to select a fixed budget of context tokens, keeping computational costs constant regardless of total video length

Architecture

The inference-time pipeline showing how the Context Selection Policy interacts with the video generation process.

Evaluation Highlights

Extends video generation length by up to 9× compared to standard autoregressive baselines while maintaining higher consistency
Outperforms sliding-window approaches (e.g., FreeNoise) on the proposed CsVBench (Cross-scene Video Benchmark) in both subject consistency and prompt alignment
Reduces inference latency compared to full-context attention by maintaining a fixed kv-cache size via the top-K selection mechanism

Breakthrough Assessment

7/10

Offers a practical, inference-only solution to the context length problem in video generation. While relying on existing RL techniques (GRPO), applying them to dynamic KV-cache selection for video is a novel and effective application.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive text-to-video generation where a video is generated as a sequence of segments V_0, V_1, ... V_N conditioned on evolving text prompts P

Inputs: Sequence of text prompts P corresponding to different scenes; initial video segment V_0

Outputs: Extended video sequence V_1...V_N that maintains subject identity and adheres to prompt changes

Pipeline Flow

Initial Segment Generation
Context Scoring (Policy Network)
Top-K Selection & KV Cache Update
Next Segment Denoising (Base Generator)

System Modules

Base Video Generator

Generates video segments via iterative denoising

Model or implementation: CausVid (distilled from WanX)

Context Selection Policy (Retrieval & Selection)

Assigns relevance scores to all available past tokens to decide what to keep

Model or implementation: Lightweight Transformer (N1 cross-attention blocks + N2 linear layers)

Token Selector (Retrieval & Selection)

Samples indices based on scores and gathers specific KV pairs

Model or implementation: Top-K Sampling (Plackett-Luce)

Novel Architectural Elements

Inference-time learnable context selection module inserted before the attention mechanism of a pre-trained DiT
Integration of Top-K ranking directly into the KV-cache management of a video diffusion model

Modeling

Base Model: CausVid (14B parameters, derived from WanX)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward of generated videos relative to group baseline.

Formally: E[1/G * sum(A_i * clip_ratio)] (standard GRPO objective without KL penalty)

Adaptation: Context Selection Policy only (Base model frozen)

Trainable Parameters: Parameters of the context selection policy (small network)

Training Data:

Not explicitly detailed beyond using generated samples for RL updates

Key Hyperparameters:

group_size_G: Not explicitly reported in the paper
clip_epsilon: Standard PPO parameter (implied)
context_budget_K: Not explicitly reported in the paper
+ 1 more
history_length_L: n * l * h * w (total tokens)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FreeNoise: Adaptive selection vs. fixed sliding window; InfLVG allows attending to distant but relevant past frames
vs. CausVid: InfLVG adds a selection policy to manage context explosion, whereas CausVid attends to full or truncated history
vs. FIFO [not cited in paper]: InfLVG uses semantic relevance for selection, whereas FIFO typically uses a first-in-first-out heuristic

Limitations

Requires a pretrained reward model suite (ArcFace, CLIP, VLM), which adds complexity to the training pipeline
The method is inference-time only, meaning it cannot correct fundamental defects in the base generator's capabilities
The top-K selection is a hard decision that might discard context if the budget K is too small (though K is tunable)

Reproducibility

Code: https://github.com/MAPLE-AIGC/InfLVG

Code is publicly available at https://github.com/MAPLE-AIGC/InfLVG. The paper defines the reward functions (ArcFace, CLIP, VLM artifact detection) clearly. Specific hyperparameters like learning rate or group size are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Long video generation extending base clips to multiple scenes

Benchmarks:

CsVBench (Cross-scene Video Benchmark) (Multi-scene video generation with shared subjects) [New]

Metrics:

Identity Consistency (ArcFace similarity)
Text Alignment (CLIP score)
Artifact Rate (VLM detection)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CsVBench	Video Length Extension	1.0	9.0	+8.0

Experiment Figures

Comparison of naive autoregressive generation vs. InfLVG.

Main Takeaways

InfLVG significantly extends video generation length (up to 9x) compared to naive autoregressive baselines.
The context selection policy effectively balances consistency (keeping subject features) and flexibility (adopting new prompts).
Visual artifacts (mosaic/blocking) are reduced by the artifact-suppression reward component.
The fixed context budget (Top-K) ensures inference costs do not explode quadratically with video length.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive generation (Transformers)
Diffusion Models (Latent Diffusion)
Reinforcement Learning (Policy Gradients)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing outcomes within a group of samples rather than using a learned value function critic

KV cache: Key-Value cache—stored intermediate representations in Transformer models used to avoid recomputing attention for past tokens during generation

Top-K ranking: A selection strategy that keeps only the K items with the highest assigned scores

DiT: Diffusion Transformer—a video generation architecture replacing the U-Net with a Transformer backbone

Autoregressive: Generating data sequentially, where each new piece depends on previously generated pieces

Plackett-Luce model: A probabilistic model for ranking items, used here to sample an ordered list of context tokens based on predicted relevance scores

CsVBench: Cross-scene Video Benchmark—a new benchmark proposed in this paper containing multi-scene prompts with shared subjects

ArcFace: A face recognition model used here to compute identity consistency rewards

CLIP: Contrastive Language-Image Pre-training—a model used to measure how well generated images match text prompts

VLM: Vision-Language Model—used here as an artifact detector to penalize low-quality generations