Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

📝 Paper Summary

Vision-Language Reinforcement Learning (RL) Synthetic Data Generation Visual Reasoning

Game-RL utilizes video games to synthesize verifiable, multimodal reasoning tasks, enabling Vision Language Models to improve general reasoning capabilities through reinforcement learning on these synthetic game scenarios.

Core Problem

Vision Language Models (VLMs) struggle with complex multi-step reasoning, partially because current Vision-Language RL training is limited to narrow domains like geometry and charts.

Why it matters:

Existing RL datasets lack the diversity needed for broad generalization, limiting VLM exploration and learning.
Real-world reasoning tasks are complex and difficult to verify automatically, whereas games offer verifiable mechanics.
Simply evaluating VLMs on games (as done in prior work) misses the opportunity to use games as a rich training ground for reasoning skills.

Concrete Example: In a game like Sokoban, a model must predict where a player ends up after a specific sequence of moves (e.g., 'left, up, down'). Standard VLMs might hallucinate the final position because they lack grounded spatial reasoning training, whereas a game engine can deterministically verify the correct coordinate (2,2).

Key Novelty

Code2Logic: Transforming Game Code into Reasoning Logic

Uses LLMs to adapt actual game source code into a 'data engine' that procedurally generates game states, reasoning questions, and verifiable ground-truth answers.
Constructs a massive dataset (GameQA) of 30 games and 158 tasks with controllable difficulty, covering spatial perception, planning, and pattern recognition.
Demonstrates that RL training on these synthetic game tasks transfers to improvements on unrelated general vision-language benchmarks.

Architecture

The Code2Logic pipeline: Steps to synthesize game data. (1) Construct Game Code, (2) Design Task Templates, (3) Build Data Engine.

Evaluation Highlights

+2.33% average improvement for Qwen2.5-VL-7B across 7 diverse vision-language benchmarks (e.g., MMMU, MathVista) after training solely on GameQA.
GameQA-trained models outperform geometry-trained baselines (MAVIS, Multimodal-Open-R1) on general benchmarks despite using fewer training samples (5k vs 8k).
Scaling game diversity from 4 to 20 games correlates positively with generalization performance on downstream tasks.

Breakthrough Assessment

8/10

Strong evidence that synthetic game data improves out-of-domain general reasoning, a significant finding for scaling VLM post-training data. The automated pipeline (Code2Logic) effectively solves the data verification bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) where inputs are game states (images) and text questions, and outputs are reasoning-based answers.

Inputs: Game screenshot image I and text question q

Outputs: Answer text a (e.g., coordinates, next move, object count)

Pipeline Flow

Game Code Construction (LLM generates/retrieves game logic)
Task Template Design (LLM designs QA patterns based on game mechanics)
Data Engine Construction (Program generates batch samples)
RL Training (VLMs trained on generated GameQA data)

System Modules

Game Code Generator (Data Generation)

Generate Python code for video games defining state space and transition rules

Model or implementation: GPT-4o or Claude 3.5 (used as tool)

Data Engine (Data Generation)

Procedurally generate game states, simulate transitions, and produce QA pairs

Model or implementation: Python Program (synthesized by LLM)

Policy Model

Vision Language Model being trained to solve reasoning tasks

Model or implementation: Qwen2.5-VL-7B / InternVL2.5-8B / Qwen2.5-VL-3B

Reward Evaluator

Verify correctness of the model's answer against ground truth

Model or implementation: Qwen2.5-32B-AWQ (LLM-as-a-judge)

Novel Architectural Elements

Code2Logic pipeline: A systematic method to map executable game code directly to verifiable reasoning logic tasks without manual annotation.

Modeling

Base Model: Qwen2.5-VL-7B (primary), InternVL2.5-8B, Qwen2.5-VL-3B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while keeping the policy close to the reference model.

Formally: Standard GRPO loss with KL-divergence penalty.
Purpose: Assign binary rewards based on answer correctness.

Formally: Reward = 1 if match(prediction, ground_truth) else 0.

Training Data:

GameQA Dataset: 30 games, 158 tasks, ~140K questions.
Split: 20 games for training (In-Domain), 10 games for testing (Out-of-Domain).

Key Hyperparameters:

learning_rate: 2e-7
batch_size: Not reported in the paper (rollout 12 samples per question)
epochs: 1
+ 3 more
warmup_ratio: 5%
kl_coefficient_beta: 0.04
clip_epsilon: 0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. MAVIS/Multimodal-Open-R1/MultiMath: GameQA uses synthetic game data rather than math/geometry problems but achieves competitive or better generalization on general benchmarks.
vs. SIMA [not cited in paper]: SIMA trains generalist agents to *play* games; GameQA trains VLMs to *reason* about games via VQA to boost general reasoning capabilities.
vs. O1/DeepSeek-R1 [not cited in paper]: Game-RL applies reasoning-focused RL (similar to O1/R1) but specifically leverages the verifiable nature of video games for the reward signal.

Limitations

Current VLMs perform significantly worse than humans on GameQA, indicating high difficulty.
Reliance on LLMs for code generation and quality checks might introduce biases or errors if not manually verified (though manual verification was part of the process).
The approach focuses on 2D/3D rendered games; applicability to photorealistic or highly noisy real-world video data is not explicitly tested.

Reproducibility

Code: https://github.com/tongjingqi/Game-RL

Code and dataset available at https://github.com/tongjingqi/Game-RL. The paper details the prompts used for Code2Logic. Hyperparameters for GRPO are provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on diverse vision-language benchmarks after RL training on GameQA.

Benchmarks:

GameQA Test Set (In-domain and Out-of-domain Game VQA) [New]
MMMU / MMMU-Pro (General Multimodal Understanding)
MathVista / MathVerse / MathVision (Mathematical Reasoning in Visual Contexts)
MMBench (General Visual Understanding)
CharXiv (Chart-based Reasoning)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main generalization results showing performance of Qwen2.5-VL-7B trained on GameQA compared to the base model across 7 benchmarks.
Average (7 General Benchmarks)	Accuracy	57.75	60.08	+2.33
MMMU (Val)	Accuracy	56.44	57.67	+1.23
MathVista	Accuracy	70.30	73.20	+2.90
Comparison with other training datasets showing GameQA's efficiency.
Average (7 General Benchmarks)	Accuracy	58.45	60.08	+1.63
Scaling analysis showing impact of data quantity and diversity.
Average (7 General Benchmarks)	Accuracy	60.08	61.34	+1.26

Experiment Figures

Line chart showing scaling trends: General benchmark performance vs. Number of Training Samples (0 to 20k).

Comparison of generalization ability between training on 4 games vs 20 games (controlling for total sample size).

Main Takeaways

RL training solely on synthetic GameQA data transfers to significant improvements (approx. 2.3%) on diverse real-world benchmarks (Math, Charts, General Vision).
GameQA enables stronger out-of-domain generalization than math/geometry-specific datasets (MAVIS, MultiMath) even with fewer samples.
Increasing the diversity of games (from 4 to 20) in the training set positively correlates with generalization performance.
Qualitative analysis shows GRPO training improves the model's ability to recognize visual elements and perform precise multi-step reasoning.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Policy Optimization)
Vision Language Models (VLMs)
Procedural Content Generation

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of generated outputs to stabilize training.

VQA: Visual Question Answering—a task where a model must answer a natural language question about an input image.

Sokoban: A classic puzzle game where a player pushes boxes to target locations; used here as a primary example for spatial reasoning tasks.

Code2Logic: The authors' proposed pipeline that converts game source code into a data generation engine for reasoning tasks.

In-Domain vs. Out-of-Domain: In-Domain refers to games seen during training; Out-of-Domain refers to held-out games or completely different benchmarks (like MathVista) used to test generalization.

LLM-as-a-judge: Using a strong Large Language Model to evaluate the correctness of a model's output when simple rule-based matching is insufficient.

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer.

QA Template: A structured format defining a specific type of question and answer pattern derived from game logic (e.g., 'State Prediction').