GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of generated outputs to stabilize training.
VQA: Visual Question Answering—a task where a model must answer a natural language question about an input image.
Sokoban: A classic puzzle game where a player pushes boxes to target locations; used here as a primary example for spatial reasoning tasks.
Code2Logic: The authors' proposed pipeline that converts game source code into a data generation engine for reasoning tasks.
In-Domain vs. Out-of-Domain: In-Domain refers to games seen during training; Out-of-Domain refers to held-out games or completely different benchmarks (like MathVista) used to test generalization.
LLM-as-a-judge: Using a strong Large Language Model to evaluate the correctness of a model's output when simple rule-based matching is insufficient.
Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer.
QA Template: A structured format defining a specific type of question and answer pattern derived from game logic (e.g., 'State Prediction').