← Back to Paper List

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Y. Qiao, Zhaoxiang Zhang, Jifeng Dai
Tsinghua University, SenseTime Research, Shanghai Artificial Intelligence Laboratory, Institute of Automation, Chinese Academy of Science
arXiv.org (2023)
Agent Memory Reasoning

📝 Paper Summary

LLM-based Agents Open-World Planning Minecraft Agents
GITM replaces low-level reinforcement learning controls with an LLM-based hierarchical planner using structured actions and text-based memory to unlock the entire Minecraft technology tree.
Core Problem
Reinforcement learning agents in Minecraft struggle to map long-horizon goals directly to low-level keyboard/mouse inputs, resulting in extreme sample inefficiency and poor generalization beyond specific tasks like ObtainDiamond.
Why it matters:
  • Current RL agents (e.g., VPT) require massive compute (thousands of GPU days) and millions of steps to learn single tasks
  • Existing agents cannot generalize to the broader open world, unlocking only ~30% of the Minecraft technology tree
  • Direct mapping from complex goals to low-level control signals creates a difficult credit assignment problem for long-horizon tasks
Concrete Example: In the 'ObtainDiamond' task, RL agents must execute ~30 million low-level steps (clicks/movement) to succeed. If an RL agent is trained only for diamonds, it cannot adapt to obtain a 'stone sword' without expensive retraining, whereas humans generalize skills immediately.
Key Novelty
Ghost in the Minecraft (GITM) Framework
  • Decomposes goals hierarchically: Goal → Sub-goals → Structured Actions (e.g., 'mine', 'craft') → Low-level execution, mirroring human cognitive processes
  • Utilizes external text-based knowledge (Wiki recipes) for decomposition and text-based memory to store and retrieve successful plans for future use
  • Abandons end-to-end RL for low-level control, instead using an LLM to parameterize hand-written script interfaces for robust interaction
Architecture
Architecture Figure Figure 3
Overview of the GITM framework showing the interaction between Decomposer, Planner, Interface, and Environment
Evaluation Highlights
  • +47.5% success rate improvement on the 'ObtainDiamond' task compared to the previous state-of-the-art (VPT)
  • First agent to unlock 100% of the Minecraft Overworld technology tree (262 items), whereas prior methods covered only ~30%
  • Achieves results using a single CPU node (32 cores) in 2 days, reducing environment interaction steps by >10,000x compared to RL baselines (e.g., 6,480 GPU days for VPT)
Breakthrough Assessment
9/10
Achieving 100% coverage of the technology tree and outperforming massive RL baselines using only CPU compute and LLM planning is a paradigm shift for open-world agents.
×