LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

📝 Paper Summary

Web agents Planning benchmarks Knowledge graph navigation

LLM-WikiRace forces agents to navigate Wikipedia step-by-step, revealing that even frontier models struggle with replanning and loop avoidance on hard tasks despite possessing sufficient world knowledge.

Core Problem

Existing planning benchmarks are typically synthetic or highly structured (e.g., Blocksworld), failing to test how LLMs leverage world knowledge to reason and replan in large, partially observable real-world environments.

Why it matters:

Synthetic environments do not capture the semantic richness and uncertainty of real-world information spaces
Current evaluations mask the distinction between having knowledge (memorization) and using it operationally for multi-step navigation
Frontier models show a 'planning gap' where strong knowledge does not translate to success in long-horizon tasks requiring error recovery

Concrete Example: In a Hard difficulty game (path length 7-8), Gemini 3 often identifies a 'hub-seeking' strategy but gets stuck in a loop visiting the same pages repeatedly, failing to revise its plan despite explicitly noting the repetition in its reasoning trace.

Key Novelty

LLM-WikiRace Benchmark

Adapts the 'WikiRace' game into an interactive agent benchmark where models must navigate Wikipedia hyperlinks to reach a target page without seeing the full graph
Decouples 'world knowledge' from 'planning ability' by comparing navigation success against a direct graph-connectivity classification task
Identifies a 'planning gap' where models with comparable knowledge fail due to specific behavioral flaws like inability to recover from loops

Architecture

Illustration of the WikiRace task structure and the agent-environment interaction loop.

Evaluation Highlights

Gemini 3 achieves <23% success on the Hard split (path length 7-8) compared to >90% on the Easy split, showing current models fail at long horizons
Looping frequency has a strong negative correlation (regression coefficient -1.02) with success rate, identifying replanning failure as a primary bottleneck
Fine-tuning with DAPO improves success on Easy tasks (+45.0% for Qwen-2.5-7B) but yields 0% improvement on Hard tasks

Breakthrough Assessment

8/10

Provides a realistic, non-synthetic planning benchmark that exposes clear failures (loops) in frontier models (GPT-5, Gemini 3) that standard benchmarks miss.

⚙️ Technical Details

Problem Definition

Setting: Interactive goal-directed navigation over the Wikipedia hyperlink graph (Snapshot: June 2025)

Inputs: Current page title, Target page title, Traversal history, List of candidate outgoing links (filtered)

Outputs: Selection of one outgoing link to transition to the next page

Pipeline Flow

Environment Initialization (Source/Target pair)
Agent Observation (Current Page + Links)
Agent Action Selection (Choose Link)
State Transition & Budget Check

System Modules

Environment Filter

Restrict available actions to manage prompt size

Model or implementation: Oracle Graph Search (Dijkstra/BFS)

Agent

Select the next link to click based on current state and history

Model or implementation: Evaluated LLM (e.g., Gemini 3, GPT-5)

Novel Architectural Elements

Evaluation protocol uses oracle-filtered action spaces (top-50 by distance) to allow long-horizon evaluation within context limits while preserving planning difficulty

Modeling

Base Model: Qwen-2.5-7B-Instruct (for fine-tuning experiments)

Training Method: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Objective Functions:

Purpose: Optimize policy based on rollout correctness.

Formally: Policy update using advantages computed from correct/optimal link choices vs. incorrect ones.

Training Data:

1,000 source-target pairs (dist 2-6) non-overlapping with evaluation set
Generated 16 rollouts per prompt during training

Key Hyperparameters:

fine_tuning_steps: 300
rollouts_per_prompt: 16

Compute: Not reported in the paper

Comparison to Prior Work

vs. Blocksworld: Requires extensive semantic world knowledge and deals with massive state space (549k nodes)
vs. QA Benchmarks (MMLU): Tests operational use of knowledge for navigation/planning rather than static fact retrieval
vs. WebShop [not cited in paper]: Focuses on information finding in Wikipedia graph structure specifically, rather than general e-commerce interaction

Limitations

Action space filtering (top-50 oracle links) simplifies the task compared to raw web navigation
Benchmark relies on a static Wikipedia snapshot, potentially drifting from live web knowledge
Hard split success rates are extremely low (<25%), limiting granular comparison of top models in that regime

Reproducibility

Code: https://llmwikirace.github.io

Benchmark code and leaderboard available at https://llmwikirace.github.io. Wikipedia snapshot dated 23 June 2025. Proprietary models (GPT-5, Gemini 3, Claude Opus 4.5) are closed-source. Fine-tuning experiments use open-source Qwen-2.5-7B.

📊 Experiments & Results

Evaluation Setup

Agent navigates Wikipedia graph from source to target. Max 30 steps. 3 Difficulty levels based on shortest path (SP) length.

Benchmarks:

LLM-WikiRace Easy (Navigation (SP length 3-4)) [New]
LLM-WikiRace Medium (Navigation (SP length 5-6)) [New]
LLM-WikiRace Hard (Navigation (SP length 7-8)) [New]

Metrics:

Success Rate
Suboptimal Steps (steps taken - shortest path)
Cost per game
Statistical methodology: Linear regression used to correlate looping frequency and success rate (coef -1.02, 95% CI reported).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance drops sharply as difficulty (path length) increases, with even the strongest models failing on Hard tasks.
LLM-WikiRace Hard	Success Rate	7%	23%	+16%
LLM-WikiRace Easy	Success Rate	Not reported in the paper	90%	Not reported in the paper
Fine-tuning improves performance on short horizons but fails to solve long-horizon planning.
LLM-WikiRace Easy	Success Rate	22.5%	67.5%	+45.0%
LLM-WikiRace Hard	Success Rate	0%	0%	0%
Models outperform the human baseline in terms of path optimality on easy tasks.
Human Gameplay Corpus	Suboptimal Steps	1.0	0.0	-1.0

Experiment Figures

Scatter plot of 'World Knowledge F1 Score' vs. 'WikiRace Success Rate' for various models.

Main Takeaways

The 'Planning Gap': Models with similar world knowledge (graph connectivity F1) exhibit vastly different navigation success, proving that knowledge alone is insufficient for planning.
Failure Mode: The primary cause of failure on Hard tasks is looping; models recognize they are in a loop but fail to replan effectively to escape.
Difficulty Stratification: Shortest path length is a robust proxy for difficulty; Easy is solved (>90%), Hard is unsolved (<25%).
Hub-Seeking: Successful agents employ a human-like strategy of navigating to high-degree nodes (hubs) to broaden reachable topics.

📚 Prerequisite Knowledge

Prerequisites

Understanding of graph traversal (BFS/DFS) vs. agentic navigation
Knowledge of LLM fine-tuning methods (RL/DAPO)
Familiarity with partial observability in RL

Key Terms

WikiRace: A game where players traverse from a source Wikipedia page to a target page using only internal hyperlinks

planning gap: The discrepancy between a model's static world knowledge (knowing concepts are connected) and its ability to operationally use that knowledge for multi-step planning

hub-seeking: A strategy of navigating to broad, highly connected pages (hubs) to increase the likelihood of finding a link to the target

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization—a reinforcement learning fine-tuning method used here to train smaller models

strongly connected component: A subgraph where every node is reachable from every other node; the benchmark uses the largest such component of Wikipedia

suboptimal steps: The number of steps taken in excess of the theoretical shortest path between the source and target

SFT: Supervised Fine-Tuning—training on demonstrated correct paths before applying RL