Evaluation Setup
Agent navigates Wikipedia graph from source to target. Max 30 steps. 3 Difficulty levels based on shortest path (SP) length.
Benchmarks:
- LLM-WikiRace Easy (Navigation (SP length 3-4)) [New]
- LLM-WikiRace Medium (Navigation (SP length 5-6)) [New]
- LLM-WikiRace Hard (Navigation (SP length 7-8)) [New]
Metrics:
- Success Rate
- Suboptimal Steps (steps taken - shortest path)
- Cost per game
- Statistical methodology: Linear regression used to correlate looping frequency and success rate (coef -1.02, 95% CI reported).
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance drops sharply as difficulty (path length) increases, with even the strongest models failing on Hard tasks. |
| LLM-WikiRace Hard |
Success Rate |
7% |
23% |
+16%
|
| LLM-WikiRace Easy |
Success Rate |
Not reported in the paper |
90% |
Not reported in the paper
|
| Fine-tuning improves performance on short horizons but fails to solve long-horizon planning. |
| LLM-WikiRace Easy |
Success Rate |
22.5% |
67.5% |
+45.0%
|
| LLM-WikiRace Hard |
Success Rate |
0% |
0% |
0%
|
| Models outperform the human baseline in terms of path optimality on easy tasks. |
| Human Gameplay Corpus |
Suboptimal Steps |
1.0 |
0.0 |
-1.0
|
Main Takeaways
- The 'Planning Gap': Models with similar world knowledge (graph connectivity F1) exhibit vastly different navigation success, proving that knowledge alone is insufficient for planning.
- Failure Mode: The primary cause of failure on Hard tasks is looping; models recognize they are in a loop but fail to replan effectively to escape.
- Difficulty Stratification: Shortest path length is a robust proxy for difficulty; Easy is solved (>90%), Hard is unsolved (<25%).
- Hub-Seeking: Successful agents employ a human-like strategy of navigating to high-degree nodes (hubs) to broaden reachable topics.