Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

📝 Paper Summary

LLM-based Agents Open-World Planning Minecraft Agents

GITM replaces low-level reinforcement learning controls with an LLM-based hierarchical planner using structured actions and text-based memory to unlock the entire Minecraft technology tree.

Core Problem

Reinforcement learning agents in Minecraft struggle to map long-horizon goals directly to low-level keyboard/mouse inputs, resulting in extreme sample inefficiency and poor generalization beyond specific tasks like ObtainDiamond.

Why it matters:

Current RL agents (e.g., VPT) require massive compute (thousands of GPU days) and millions of steps to learn single tasks
Existing agents cannot generalize to the broader open world, unlocking only ~30% of the Minecraft technology tree
Direct mapping from complex goals to low-level control signals creates a difficult credit assignment problem for long-horizon tasks

Concrete Example: In the 'ObtainDiamond' task, RL agents must execute ~30 million low-level steps (clicks/movement) to succeed. If an RL agent is trained only for diamonds, it cannot adapt to obtain a 'stone sword' without expensive retraining, whereas humans generalize skills immediately.

Key Novelty

Ghost in the Minecraft (GITM) Framework

Decomposes goals hierarchically: Goal → Sub-goals → Structured Actions (e.g., 'mine', 'craft') → Low-level execution, mirroring human cognitive processes
Utilizes external text-based knowledge (Wiki recipes) for decomposition and text-based memory to store and retrieve successful plans for future use
Abandons end-to-end RL for low-level control, instead using an LLM to parameterize hand-written script interfaces for robust interaction

Architecture

Overview of the GITM framework showing the interaction between Decomposer, Planner, Interface, and Environment

Evaluation Highlights

+47.5% success rate improvement on the 'ObtainDiamond' task compared to the previous state-of-the-art (VPT)
First agent to unlock 100% of the Minecraft Overworld technology tree (262 items), whereas prior methods covered only ~30%
Achieves results using a single CPU node (32 cores) in 2 days, reducing environment interaction steps by >10,000x compared to RL baselines (e.g., 6,480 GPU days for VPT)

Breakthrough Assessment

9/10

Achieving 100% coverage of the technology tree and outperforming massive RL baselines using only CPU compute and LLM planning is a paradigm shift for open-world agents.

⚙️ Technical Details

Problem Definition

Setting: Open-world task completion in Minecraft, specifically collecting items defined in the Overworld technology tree.

Inputs: High-level goal string (e.g., 'Obtain 1 diamond')

Outputs: Sequence of keyboard and mouse operations to execute the goal

Pipeline Flow

User Goal -> LLM Decomposer -> Sub-goal Tree
Sub-goal -> LLM Planner (consulting Memory/Feedback) -> Structured Actions
Structured Actions -> LLM Interface -> Keyboard/Mouse Events

System Modules

LLM Decomposer (Planning)

Decompose the main goal into a dependency tree of sub-goals based on external knowledge

Model or implementation: gpt-3.5-turbo

LLM Planner (Planning)

Generate a sequence of structured actions for a given sub-goal, utilizing memory and feedback

Model or implementation: gpt-3.5-turbo

LLM Interface

Execute structured actions by converting them into low-level keyboard/mouse control

Model or implementation: Hand-written scripts (based on MineDojo API)

Novel Architectural Elements

Three-stage hierarchy: Decomposer (Goal->Subgoal) -> Planner (Subgoal->Structured Action) -> Interface (Structured Action->Control)
Closed-loop feedback mechanism injecting execution status and inventory state back into the LLM prompt for replanning
Explicit text-based memory bank that summarizes successful action sequences for reuse across episodes

Modeling

Base Model: gpt-3.5-turbo (OpenAI API)

Compute: 1 single CPU node with 32 CPU cores (no GPU required for the agent logic). Total training time ~2 days.

Comparison to Prior Work

vs. VPT: GITM uses hierarchical LLM planning + scripted execution vs. end-to-end RL/BC policy
vs. DEPS: GITM uses scripted structured actions vs. DEPS using a pre-trained RL controller for sub-tasks
vs. DreamerV3: GITM leverages external text knowledge and memory vs. learning purely from environmental reward

Limitations

Depends on the availability and quality of external text knowledge (Wiki/Recipes)
Reliance on hand-written scripts for the Interface layer may limit dexterity compared to learned low-level policies
Cost and latency associated with querying the OpenAI API (GPT-3.5) during gameplay

Reproducibility

Code: https://github.com/OpenGVLab/GITM

Code is publicly available at https://github.com/OpenGVLab/GITM. External knowledge base is built from Minecraft Wiki and MineDojo recipes. Structured actions are extracted from MineDojo tasks. Training relies on OpenAI API (gpt-3.5-turbo).

📊 Experiments & Results

Evaluation Setup

Minecraft Overworld environment via MineDojo. Tasks involve collecting specific items.

Benchmarks:

ObtainDiamond (Long-horizon item collection)
Overworld Technology Tree Coverage (Multi-task generalization (262 items)) [New]

Metrics:

Success Rate (%)
Coverage of Technology Tree (Count)
Statistical methodology: Run 40 games per setting for ablation studies. Comparison baselines use different time limits (GITM uses strictest 10 mins).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GITM significantly outperforms RL baselines on the ObtainDiamond benchmark.
ObtainDiamond	Success Rate (%)	20.0	67.5	+47.5
ObtainDiamond	Success Rate (%)	0.01	67.5	+67.49
ObtainDiamond	Success Rate (%)	0.6	67.5	+66.9
Ablation studies demonstrate the critical role of external knowledge and memory.
ObtainDiamond	Success Rate (%)	0.0	67.5	+67.5
ObtainDiamond	Success Rate (%)	0.0	67.5	+67.5

Experiment Figures

Success rates for all items in the Overworld Technology Tree, comparing GITM, DEPS, DreamerV3, and VPT

Learning efficiency comparison (Success Rate vs. Steps)

Main Takeaways

GITM unlocks the entire Overworld technology tree (262 items), a capability previously unachieved by RL methods (max ~30%).
Text-based memory increases success rate significantly (e.g., +32.5% on Diamond) by summarizing successful experiences into reusable plans.
Learning efficiency is orders of magnitude higher: GITM reaches high performance in ~5k steps with CPU only, compared to millions of steps and GPU clusters for RL.
Goal decomposition is essential for long-horizon tasks; without it, the planner cannot look far enough ahead to obtain items like Diamonds.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting strategies
Hierarchical planning concepts
Minecraft game mechanics (crafting recipes, biomes)

Key Terms

ObtainDiamond: A standard Minecraft benchmark task requiring the agent to survive, gather resources, craft tools, and mine a diamond from scratch

Structured Actions: High-level, semantic actions (e.g., 'mine(iron_ore)', 'craft(stick)') used by the LLM planner, which are then translated into low-level controls by scripts

VPT: Video PreTraining—a baseline method that trains a foundation model for Minecraft using massive unlabeled video data

DreamerV3: A model-based reinforcement learning algorithm used as a baseline

Technology Tree: The hierarchy of craftable items in Minecraft, where obtaining advanced items requires possessing specific prerequisite tools and materials

MineDojo: A simulation platform and benchmark suite for developing open-ended embodied agents in Minecraft

LLM Decomposer: A module that breaks a high-level goal into a tree of prerequisite sub-goals (e.g., Diamond -> Iron Pickaxe -> Stone Pickaxe)

Overworld: The main dimension in Minecraft where standard gameplay takes place