Reinforcement Learning with Inverse Rewards for World Model Post-training

📝 Paper Summary

World Models Video Generation Reinforcement Learning Post-training

RLIR improves the action-following ability of video world models by using an Inverse Dynamics Model to derive verifiable rewards from generated videos, avoiding expensive human annotation.

Core Problem

Current video world models generate high-fidelity visuals but often fail to accurately follow specific human-specified actions (e.g., ignoring a 'jump' command).

Why it matters:

Accurate action-following is critical for world models to serve as reliable simulators for agents in gaming or robotics
Collecting human preference annotations for video is prohibitively expensive and hard to scale compared to text
Rule-based verifiers (used in coding/math LLMs) are generally infeasible for high-dimensional video outputs

Concrete Example: In a Minecraft simulation, if a user inputs a 'dig' action, a standard world model might generate a visually plausible frame where the character merely stands still. The proposed method detects this mismatch by inferring the action 'stand' from the video, penalizing the generation.

Key Novelty

Reinforcement Learning with Inverse Rewards (RLIR)

Uses an Inverse Dynamics Model (IDM) to map generated high-dimensional video back to low-dimensional action space
Calculates reward by comparing the IDM-inferred action with the original ground-truth input action
Optimizes the world model using Group Relative Policy Optimization (GRPO) based on this objective action-consistency signal

Evaluation Highlights

+5-10% improvement in action-following metrics (F1, Precision, Recall) across autoregressive and diffusion world models
Up to +10% improvement in visual quality metrics (FVD, VBench) despite optimizing for action accuracy
Higher human preference scores for both action-following and visual quality compared to base models

Breakthrough Assessment

8/10

First post-training method specifically designed for action-following in video world models. elegantly bypasses the need for video reward models or human labeling by leveraging inverse dynamics.

⚙️ Technical Details

Problem Definition

Setting: Post-training video world models to align generated outputs with conditioning actions

Inputs: Initial state x0 and a sequence of actions a1...an

Outputs: Generated video frames x^1...x^n that reflect the input actions

Pipeline Flow

World Model (generates video trajectory from actions)
Inverse Dynamics Model (infers actions from generated video)
Reward Calculation (compares inferred actions vs. input actions)
GRPO Update (optimizes World Model)

System Modules

World Model

Generate video sequence conditioned on input actions

Model or implementation: MineWorld (Autoregressive) or NFD (Diffusion)

Inverse Dynamics Model (IDM) (Reward Calculation)

Infer the action that caused the transition between generated frames

Model or implementation: VPT-pretrained IDM (Transformer-based)

Reward Function (Reward Calculation)

Compute scalar reward based on action alignment

Model or implementation: Deterministic comparison

Modeling

Base Model: MineWorld (Autoregressive LLaMA-based) and NFD (Diffusion Transformer-based)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize the likelihood that generated videos contain the input actions.

Formally: Reward R = sum(Indicator(predicted_action == ground_truth_action)) over the trajectory.
Purpose: Ensure training stability by limiting policy updates.

Formally: GRPO objective with KL divergence penalty.

Training Data:

VPT dataset (Minecraft gameplay)
Data filtered to remove GUI interactions and static scenes

Key Hyperparameters:

inference_length: 16 frames
sampling_top_p: 0.8 (MineWorld)
diffusion_steps: 18 (NFD)
+ 2 more
batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Training conducted on AMD MI300X GPUs. Converges with ~1,000 training samples.

Comparison to Prior Work

vs. RLVR: Applies verifiable rewards to video domain via IDM instead of code execution
vs. Standard RLHF: Uses objective action-consistency reward instead of learned human preference model
vs. Naive Video RL [not cited in paper]: Optimizes action-fidelity specifically rather than just visual quality (Aesthetics) or text alignment

Limitations

Relies on the accuracy of the Inverse Dynamics Model; if the IDM is flawed, the reward signal is noisy
Currently evaluated primarily on gaming environments (Minecraft) where actions are discrete and well-defined
Does not explicitly model long-horizon planning beyond the context window of the generation
Requires ground-truth actions, limiting applicability to unconditional video generation tasks

Reproducibility

IDM uses publicly available VPT weights. Base models (MineWorld, NFD) are existing pretrained models. Code availability is not provided in the text. Evaluation relies on standard metrics (FVD, VBench) and the IDM itself.

📊 Experiments & Results

Evaluation Setup

Interactive game generation (Minecraft) using VPT dataset

Benchmarks:

Minecraft Action Following (Action-conditioned Video Generation)

Metrics:

Action F1
Action Precision
Action Recall
Fréchet Video Distance (FVD)
PSNR
VBench (visual quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RLIR significantly improves action-following metrics across both autoregressive and diffusion architectures compared to base models.
Minecraft (MineWorld)	Action F1	0.78	0.85	+0.07
Minecraft (NFD)	Action F1	0.65	0.72	+0.07
Visual quality improves alongside action accuracy, likely because better action coherence reduces artifacts.
Minecraft	FVD	Not explicitly reported as exact number in text	Not explicitly reported as exact number in text	Improved (Lower)

Experiment Figures

Sensitivity of IDM to visual artifacts. A manually retouched image (cracks on trunk) causes the IDM to misclassify the action.

Qualitative comparison of generated videos before and after RLIR

Main Takeaways

Consistent 5-10% gains in action-following (F1, Recall, Precision) across both autoregressive and diffusion paradigms
Visual quality (FVD, VBench) improves simultaneously, suggesting that enforcing action consistency helps reduce visual artifacts (e.g., blurring during rapid movement)
The method is data-efficient, converging with only ~1,000 training samples
Human evaluation confirms that quantitative gains align with user preferences for both control and visual fidelity

📚 Prerequisite Knowledge

Prerequisites

World Models (video generation conditioned on actions)
Reinforcement Learning (specifically PPO/GRPO)
Inverse Dynamics Models

Key Terms

IDM: Inverse Dynamics Model—a model that predicts the action taken between two consecutive video frames

World Model: A generative model that simulates an environment by predicting future states (video frames) based on past states and actions

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that estimates advantages by comparing a group of outputs for the same input, removing the need for a value network

Autoregressive World Model: Generates video by predicting discrete tokens one by one (e.g., MineWorld)

Diffusion World Model: Generates video by iteratively denoising random noise, often using techniques like Diffusion Forcing (e.g., NFD)

FVD: Fréchet Video Distance—a metric for evaluating the quality and temporal coherence of generated videos

VBench: A comprehensive benchmark suite for evaluating video generation models

VQ-VAE: Vector Quantized Variational AutoEncoder—compresses images into discrete tokens

SDE: Stochastic Differential Equation—mathematical framework used to model the diffusion process