← Back to Paper List

Reinforcement Learning with Inverse Rewards for World Model Post-training

Yang Ye, Tianyu He, Shuo Yang, Jiang Bian
arXiv.org (2025)
RL MM

📝 Paper Summary

World Models Video Generation Reinforcement Learning Post-training
RLIR improves the action-following ability of video world models by using an Inverse Dynamics Model to derive verifiable rewards from generated videos, avoiding expensive human annotation.
Core Problem
Current video world models generate high-fidelity visuals but often fail to accurately follow specific human-specified actions (e.g., ignoring a 'jump' command).
Why it matters:
  • Accurate action-following is critical for world models to serve as reliable simulators for agents in gaming or robotics
  • Collecting human preference annotations for video is prohibitively expensive and hard to scale compared to text
  • Rule-based verifiers (used in coding/math LLMs) are generally infeasible for high-dimensional video outputs
Concrete Example: In a Minecraft simulation, if a user inputs a 'dig' action, a standard world model might generate a visually plausible frame where the character merely stands still. The proposed method detects this mismatch by inferring the action 'stand' from the video, penalizing the generation.
Key Novelty
Reinforcement Learning with Inverse Rewards (RLIR)
  • Uses an Inverse Dynamics Model (IDM) to map generated high-dimensional video back to low-dimensional action space
  • Calculates reward by comparing the IDM-inferred action with the original ground-truth input action
  • Optimizes the world model using Group Relative Policy Optimization (GRPO) based on this objective action-consistency signal
Evaluation Highlights
  • +5-10% improvement in action-following metrics (F1, Precision, Recall) across autoregressive and diffusion world models
  • Up to +10% improvement in visual quality metrics (FVD, VBench) despite optimizing for action accuracy
  • Higher human preference scores for both action-following and visual quality compared to base models
Breakthrough Assessment
8/10
First post-training method specifically designed for action-following in video world models. elegantly bypasses the need for video reward models or human labeling by leveraging inverse dynamics.
×