Recursive Language Models

📝 Paper Summary

Context management Agentic reasoning Inference-time scaling

RLMs treat long prompts as external variables in a REPL environment, allowing LLMs to programmatically decompose and recursively process inputs of unbounded length.

Core Problem

Frontier LLMs suffer from 'context rot' where performance degrades on long inputs, and existing compaction methods lose critical details needed for information-dense tasks.

Why it matters:

LLMs are increasingly adopted for long-horizon tasks (processing 10M+ tokens) where current context windows are insufficient
Context compaction (summarization) fails on tasks requiring dense access to specific details throughout the prompt
Directly feeding long prompts into Transformers is computationally expensive and hits hard architectural limits

Concrete Example: In the OOLONG-Pairs task (aggregating pairs of chunks), standard GPT-5 achieves <0.1% F1 because it cannot maintain attention over the full sequence. An RLM solves this by programmatically looping over the data and aggregating partial results into a variable, achieving 58.0% F1.

Key Novelty

Recursive Language Models (RLMs)

Treats the user prompt as a variable in an external REPL environment rather than feeding it directly into the LLM context window
Enables symbolic recursion: the model writes code to slice the prompt and invoke itself (sub-calls) on those slices to perform semantic work
Uses a persistent REPL loop to store intermediate results in variables, effectively decoupling memory capacity from the model's context window

Architecture

Comparison of Standard LLM, Code Agent, and Recursive Language Model (RLM) workflows.

Evaluation Highlights

Outperforms vanilla GPT-5 by 28.4% on the linear-complexity OOLONG task while maintaining comparable cost
Achieves 58.0% F1 on the quadratic-complexity OOLONG-Pairs task where vanilla GPT-5 fails completely (<0.1% F1)
Fine-tuned RLM-Qwen3-8B outperforms the base Qwen3-8B model by a median of 28.3% across four long-context tasks

Breakthrough Assessment

9/10

Offers a paradigm shift for long-context processing by moving the prompt out of the context window and into an executable environment. Demonstrates orders-of-magnitude scaling on hard tasks where frontier models fail.

⚙️ Technical Details

Problem Definition

Setting: Processing an arbitrary-length prompt P >> K (model context size) to produce response Y

Inputs: String prompt P of arbitrary structure and length

Outputs: String response Y generated via symbolic interaction with P

Pipeline Flow

Initialization (Prompt P stored as variable in REPL)
Root LLM Loop (Generates code to inspect P, sub-call, or compute)
Execution (Code runs in REPL; sub-calls invoke new RLM instances)
Termination (Final variable set, loop ends)

System Modules

Environment Initializer

Sets up the REPL with the user prompt P as a variable and defines the sub_RLM function

Model or implementation: Deterministic code

Root Agent

Generates Python code to inspect prompt metadata, slice prompts, and invoke sub-agents

Model or implementation: LLM (e.g., GPT-5, Qwen3-Coder)

Executor

Executes Python code, manages variables, and handles sub_RLM calls

Model or implementation: Python Interpreter

Novel Architectural Elements

Prompt-as-Variable: The user prompt is stored strictly in the external environment, not the context window
Symbolic Recursion Interface: Explicit API allowing the model to invoke itself programmatically on data slices
Recursive Inference Loop: A control flow where the 'depth' of processing is determined dynamically by the model's recursion depth

Modeling

Base Model: GPT-5 (medium reasoning), Qwen3-Coder-480B-A35B, Qwen3-8B

Training Method: Supervised Fine-Tuning (SFT) on RLM trajectories

Adaptation: Full fine-tuning (implied for small model)

Training Data:

1,000 filtered trajectories from Qwen3-Coder-480B-A35B operating as an RLM
Tasks: LongBenchPro (unrelated to eval domains)

Key Hyperparameters:

training_samples: 1000

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemWalker: RLM uses dynamic programmatic recursion rather than a static tree structure
vs. CodeAct: RLM offloads the prompt to the environment; CodeAct puts it in the context window
vs. ReSum: RLM preserves dense information via recursion; ReSum relies on lossy compression
+ 1 more
vs. AgentFold [not cited in paper]: AgentFold focuses on task decomposition for known schemas; RLM focuses on arbitrary long-context data manipulation via recursion

Limitations

Inference cost can be high and has high variance due to potentially infinite loops or deep recursion
Performance on very short prompts is slightly worse than base models due to overhead
Requires a capable base model that can write correct Python code and manage control flow
Latency is higher than single-call models due to sequential execution of the REPL loop

Reproducibility

Code: https://github.com/alexzhang13/rlm

Code is publicly available at https://github.com/alexzhang13/rlm. The paper uses closed-source GPT-5 and Qwen3-Coder-480B via API. Fine-tuning used Qwen3-8B.

📊 Experiments & Results

Evaluation Setup

Task-agnostic inference on long-context benchmarks

Benchmarks:

S-NIAH (Single Needle-in-a-Haystack (finding specific phrase))
BrowseComp-Plus (1K) (Multi-hop QA over 1000 documents)
OOLONG (Long reasoning (linear complexity aggregation))
OOLONG-Pairs (Long reasoning (quadratic complexity aggregation)) [New]
LongBench-v2 CodeQA (Code repository understanding)

Metrics:

Accuracy
F1 Score
Percentage of correct answers
Cost ($)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RLM outperforms base models and baselines on complex reasoning tasks (OOLONG variants), where standard models fail completely or degrade significantly.
OOLONG	Score	71.6	100.0	+28.4
OOLONG-Pairs	F1 Score	0.1	58.0	+57.9
OOLONG	Score	46.1	79.4	+33.3
On retrieval-heavy tasks (BrowseComp-Plus), RLMs outperform summarization and retrieval agents while maintaining lower costs.
BrowseComp-Plus (1K)	Accuracy (%)	69.3	98.7	+29.4
Fine-tuning a small model (8B) on RLM trajectories yields massive improvements.
Average across 4 tasks	Relative Improvement	0.0	28.3	+28.3

Experiment Figures

Performance degradation of GPT-5 vs. RLM(GPT-5) as context length increases across three tasks (S-NIAH, OOLONG, OOLONG-Pairs).

Cost distribution (box plots) for different methods.

Main Takeaways

RLMs prevent 'context rot' by keeping the main prompt outside the neural network, allowing performance to degrade much more slowly than vanilla LLMs as length increases.
Symbolic recursion is essential for high-complexity tasks: on OOLONG-Pairs, sub-calls allow the model to manage quadratic complexity that overwhelms standard attention mechanisms.
RLMs are cost-competitive: by selectively peeking at data rather than ingesting it all, RLM(GPT-5) is cheaper than summarization agents on massive inputs.
The paradigm is learnable: a small 8B model can be fine-tuned to become a natively recursive model with just 1,000 examples.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with LLM context window limitations
Understanding of REPL (Read-Eval-Print Loop) environments
Basic knowledge of agentic tool use and code execution

Key Terms

REPL: Read-Eval-Print Loop—an interactive programming environment where the model can execute code and see results immediately

Context Rot: The phenomenon where LLM reasoning quality degrades steeply as the input prompt length increases, even within the allowed window

Symbolic Recursion: The ability of the model to write code that invokes the model itself on programmatically defined subsets of data

Context Compaction: A baseline strategy where context is repeatedly summarized (compressed) to fit within a model's window, often losing detail

S-NIAH: Single Needle-In-A-Haystack—a benchmark task requiring the retrieval of a specific small piece of information from a large text

OOLONG: A long-context reasoning benchmark requiring linear processing of the input (using nearly all entries)

OOLONG-Pairs: A variation of OOLONG requiring quadratic processing (aggregating pairs of chunks), extremely difficult for standard Transformers

CodeAct: An agent framework that allows LLMs to execute code actions; used here as a baseline

ReAct: Reasoning and Acting—a prompting paradigm where models generate reasoning traces before taking actions