Mastering Board Games by External and Internal Planning with Language Models

📝 Paper Summary

System 2 Reasoning in LLMs Neurosymbolic Search Game Playing Agents

The paper introduces Multi-Action-Value (MAV) models that integrate world modeling, value functions, and policies to perform search-based planning in board games, either by guiding external MCTS or by internalizing search trees into the model's context.

Core Problem

LLMs struggle with reliable multi-step planning and reasoning in complex domains like board games because they lack consistent world models and ability to reason over possible futures.

Why it matters:

Games astutely expose the inability of current LLMs to consistently reason over future states, serving as a critical testbed for general planning
Standard LLMs rely on associative System 1 inference, prone to hallucinations, rather than deliberate System 2 planning required for reliability
Existing game-playing LLMs often rely on external engines for legal move generation and state tracking, limiting their standalone reasoning capabilities

Concrete Example: In a winning chess position, a standard LLM might play aimlessly or hallucinate an illegal move because it cannot look ahead to see the checkmate. MAV uses %top_k to evaluate moves and %best_action to force a decisive win, effectively 'seeing' the win via search.

Key Novelty

Multi-Action-Value (MAV) Model for Unified Planning

Trains a single Transformer to act simultaneously as a world model (state tracking), policy, and value function for board games.
Introduces 'External Search' where the MAV guides MCTS without any external game engine, using its own predictions for state transitions and legality.
Introduces 'Internal Search' by training the model on linearized text representations of search trees, effectively distilling the search process into the forward pass.

Architecture

The input/output format for the Multi-Action-Value (MAV) model.

Evaluation Highlights

MAV (2.7B) reaches Grandmaster-level performance in Blitz chess (Elo ~2905 against bots), significantly outperforming raw policy networks.
Internal Search (MAV-IS) achieves an internal Elo of 2673 in Chess, outperforming the base MAV model (2568) without external search.
External Search with MAV beats State-of-the-Art (SOTA) Chess LLMs like Grandmaster-Pro and typical engines like Stockfish 16 (at very low node counts).

Breakthrough Assessment

8/10

Demonstrates that LLMs can internalize the entire search loop (world model + search) without external engines, reaching SOTA chess performance with relatively small models (2.7B).

⚙️ Technical Details

Problem Definition

Setting: Perfect-information board games (Chess, Chess960, Connect Four, Hex) treated as sequence modeling tasks.

Inputs: Textual command header + current board state representation (e.g., FEN or token grid) + optional history.

Outputs: Predicted next state, list of legal moves, win-probability buckets for moves, or best action token.

Pipeline Flow

Input Processing (State formatting)
MAV Inference (Generates candidates/values OR Internal Search Trace)
Search Execution (External MCTS or Internal Generation)
Action Selection (Best move choice)

System Modules

MAV Base Model

Predicts legal moves, state transitions, and value buckets for moves.

Model or implementation: Gemini-based Decoder-only Transformer (2.7B parameters)

External MCTS Engine

Orchestrates search using MAV predictions. Manages the tree structure.

Model or implementation: Algorithm (Async MCTS with Dynamic Virtual Counts)

Internal Search Generator

Generates a text-based search tree (depth-first traversal) directly in context.

Model or implementation: MAV-IS (Fine-tuned MAV)

Novel Architectural Elements

Integration of world model (state tracking), policy, and value estimation into a single Transformer forward pass via specific command tokens.
Dynamic Virtual Counts mechanism for Async MCTS to balance exploration/exploitation when using expensive LLM evaluations.

Modeling

Base Model: Gemini architecture (Decoder-only Transformer), 2.7B parameters (MAV) and 1B parameters (MAV-small).

Training Method: Supervised Fine-Tuning (SFT) on game data and search trees.

Objective Functions:

Purpose: Minimize prediction error for tokens (standard language modeling).

Formally: Cross-entropy loss on masked inputs.

Training Data:

Chess: 18M games (Lichess), annotated with Stockfish 16.
Chess960: 250k games, annotated with Stockfish 16.
Connect Four: Solved dataset (Tromp).
Hex: 200k self-play games (MoHex), annotated with neurobenzene.

Key Hyperparameters:

batch_size: 512 (Internal Search Fine-tuning)
training_steps: 20,000 (Internal Search Fine-tuning)
epochs: 1.9 (MAV pre-training)
+ 1 more
value_buckets: 64 discrete tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Grandmaster-Pro: MAV integrates world modeling and search directly, whereas GM-Pro typically relies on raw policy or external engine calls.
vs. Stockfish: MAV uses a neural world model and is a general-purpose sequence model, whereas Stockfish is a highly optimized hand-crafted search engine.
vs. AlphaZero: MAV is trained via supervised learning on annotated games/values rather than self-play RL, and handles state transitions textually.

Limitations

Heavy inference cost of LLMs makes MCTS slower than specialized engines.
Internal search context length limits the depth and breadth of the search tree.
Reliance on stockfish/solver annotations for training data (supervised distillation) rather than pure self-play discovery.

Reproducibility

MAV-small (1B) is playable online. Datasets are described (Lichess, etc.) but specific processed training files not explicitly linked. Code URL points to a demo/landing page, not a repo.

📊 Experiments & Results

Evaluation Setup

Head-to-head matches between agents (Games League) to compute Elo ratings.

Benchmarks:

Chess (Blitz & Standard) (Board Game)
Chess960 (Board Game)
Connect Four (Board Game)
Hex (Board Game)

Metrics:

Elo Rating (Internal and External)
Win Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of MAV models in Chess compared to baselines and ablations.
Chess (Lichess Blitz)	External Elo	2572	2905	+333
Chess	Internal Elo	2568	2830	+262
Chess	Internal Elo	2568	2673	+105
Generalization to other games (Connect Four, Hex, Chess960).
Connect Four	Win Rate vs Perfect Solver	13	39	+26
Hex (9x9)	Internal Elo	73	314	+241

Experiment Figures

An example of Internal Search trace generated by MAV-IS.

Elo ratings of MAV with External Search as a function of simulation budget.

Main Takeaways

Search-based planning (both internal and external) consistently improves performance across all tested games (Chess, Connect Four, Hex).
MAV successfully acts as a standalone world model, removing the need for external game engines during search.
Internal Search (generating the tree in-context) yields significant gains over the base policy but currently underperforms full external MCTS due to context length constraints.
The model creates strong baselines for System 2 reasoning in LLMs using board games as a testbed.

📚 Prerequisite Knowledge

Prerequisites

Monte Carlo Tree Search (MCTS)
Transformer architecture (Decoder-only)
Reinforcement Learning (Policy/Value functions)
Chess/Board Game terminology (Elo, FEN)

Key Terms

MAV: Multi-Action-Value model—a Transformer trained to be a policy, value function, and world model simultaneously.

External Search: Using the LLM to guide a traditional search algorithm (like MCTS), where the LLM provides priors and values.

Internal Search: Training the LLM to generate a linearized text representation of a search tree in its context window to select the best move.

PUCT: Predictor + Upper Confidence Bound applied to Trees—a standard algorithm for selecting nodes during MCTS.

Virtual Counts: A technique in parallel MCTS to temporarily increase visit counts of nodes being evaluated to encourage diversity in simultaneous simulations.

Centipawn: A unit of measure used in chess engines to evaluate the advantage of one side (100 centipawns = 1 pawn).

FEN: Forsyth-Edwards Notation—a standard text format for describing a particular board position of a chess game.

Elo: A rating system used to calculate the relative skill levels of players in zero-sum games.

Linearized Tree: Representing the branching structure of a search tree as a flat sequence of tokens (e.g., depth-first traversal) for LLM training.