RePo: Language Models with Context Re-Positioning

📝 Paper Summary

Context Management Position Encoding

RePo replaces rigid linear position indices in LLMs with a learnable re-positioning mechanism that dynamically assigns token positions based on relevance, improving performance on noisy and long-context tasks.

Core Problem

Standard LLMs assign fixed linear or constant position indices to tokens, imposing a rigid structure that fails to reflect actual information relevance.

Why it matters:

Rigid structures increase extraneous cognitive load, wasting finite working memory (attention capacity) on organizing disordered information rather than reasoning
Tasks requiring long-range dependencies (e.g., needle-in-a-haystack) suffer because linear positioning forces locality bias, making distant but relevant tokens harder to attend to
Linear assignment treats all context as equally spaced, limiting the model's ability to group related information or ignore noise

Concrete Example: In a 'needle-in-a-haystack' task where a critical answer (needle) is buried far from the question (query) amidst irrelevant text, standard RoPE attention focuses on nearby tokens due to locality bias. RePo dynamically assigns the 'needle' a position closer to the 'query' in the embedding space, allowing the model to attend to it despite the long linear distance.

Key Novelty

Context Re-Positioning (RePo)

Introduces a lightweight, differentiable module that predicts a continuous position value for each token based on its content, rather than its sequence index
Optimizes these predicted positions end-to-end using differentiable position encodings (like RoPE), allowing the model to 'move' relevant tokens closer together in attention space
Inspired by Cognitive Load Theory, it treats position assignment as a way to reduce extraneous load by organizing context more efficiently for the attention mechanism

Evaluation Highlights

+11.04 points average improvement over RoPE on the RULER benchmark (noisy context) within training context length
Outperforms baselines by at least 13.25 EM points on QA and Needle-in-a-Haystack tasks when extending context to 16K tokens (4x training length)
+5.48 points average improvement on LongBench compared to baselines, demonstrating superior long-context generalization

Breakthrough Assessment

8/10

Offers a fundamental rethinking of position embeddings—from fixed indices to dynamic, content-aware values. Significant gains in noise robustness and long-context generalization without heavy architectural changes.

⚙️ Technical Details

Problem Definition

Setting: Language modeling with dynamic position assignment

Inputs: Sequence of tokens x = (x_1, ..., x_L)

Outputs: Next token prediction probabilities

Pipeline Flow

Input Embedding
Standard Transformer Layers (Lower Layers 1 to l-1)
RePo-augmented Layers (Layers l to L)

System Modules

Standard Layers

Process local surface-level features (syntax, POS tagging) using standard fixed position encodings

Model or implementation: OLMo-2 Transformer blocks

Position Representation (RePo Mechanism)

Extract a lower-dimensional position representation from the token's hidden state

Model or implementation: SwiGLU sub-layer

Position Assignment (RePo Mechanism)

Project the position representation to a scalar continuous position value

Model or implementation: Linear projection W_z

RePo Attention (RePo Mechanism)

Compute attention scores using the dynamically assigned positions in the position encoding function (e.g., RoPE)

Model or implementation: Modified Attention with g_theta(z_j - z_i)

Novel Architectural Elements

Learnable Position Assignment Module: A differentiable network (SwiGLU + Linear) inserted before attention to predict continuous position values from hidden states
Hybrid Layer Strategy: Applying RePo only to upper layers (e.g., from layer 5) while keeping standard fixed positions for lower layers

Modeling

Base Model: OLMo-2 1B (comparable to Qwen-2.5)

Training Method: Continual Pre-training

Objective Functions:

Purpose: Standard language modeling.

Formally: Next-token prediction loss

Training Data:

50B tokens from OLMo-2 stage-2 data

Key Hyperparameters:

training_context_length: 4096 tokens
repo_start_layer: 5
position_representation_dim: 256
+ 2 more
training_tokens: 50B
hardware: 4 H100 GPUs

Compute: Negligible overhead during inference (lightweight MLP)

Comparison to Prior Work

vs. RoPE: RePo learns positions dynamically based on content relevance rather than using fixed monotonic integers
vs. NoPE: RePo allows for distinct positions per token while NoPE effectively collapses them; RePo can learn NoPE-like behavior (constant positions) if optimal
vs. CoPE [not cited in paper]: CoPE increments positions based on a gate (soft count), while RePo predicts absolute continuous positions directly from hidden states

Limitations

Requires continual pre-training (50B tokens used in experiments), not a plug-and-play inference-only modification
KV cache re-computation avoided by only using positions for encoding, but full sorting of KV cache by new positions is computationally prohibitive
Evaluated primarily on 1B scale models; scaling laws to larger models not explicitly demonstrated in this paper

Reproducibility

Code: https://github.com/SakanaAI/repo

Code available at https://github.com/SakanaAI/repo. Based on open-source OLMo-2 1B checkpoint. Training data and configuration identical to OLMo-2 release.

📊 Experiments & Results

Evaluation Setup

Continual pre-training on general data followed by zero-shot evaluation on specific tasks

Benchmarks:

RULER (Noisy context & Long context retrieval)
LongBench (Long-context understanding (QA, Summarization, etc.))
NLGraph (Graph reasoning (structured data))
HybridQA (Table reasoning (structured data))
MMLU/HellaSwag/ARC (General short-context language understanding)

Metrics:

Accuracy
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RePo significantly outperforms baselines on noisy context tasks within the training length (4K).
RULER (Noisy Context 4K)	Average Score	76.43	87.47	+11.04
RULER (Noisy Context 4K)	Average Score	59.81	87.47	+27.66
RePo demonstrates superior generalization to longer contexts (up to 16K) despite only being trained on 4K.
RULER (QA + NIAH 16K)	Exact Match (EM)	58.00	71.25	+13.25
LongBench	Average Score	24.93	30.41	+5.48
RePo helps with structured data tasks where linear order is less meaningful.
NLGraph + HybridQA	Average EM	47.78	49.72	+1.94

Experiment Figures

Distribution of assigned position distances and patterns.

Main Takeaways

RePo effectively reduces extraneous cognitive load, evidenced by massive gains (+11 points) in noisy context tasks.
The learned positions break locality bias: attention analysis shows RePo assigns 'needle' tokens closer to 'queries' in embedding space, enabling better retrieval.
The method generalizes to 4x training length (16K tokens) better than strong baselines like YaRN-extended RoPE.
Learned patterns are non-trivial: they are neither strictly linear nor constant, but a hybrid that adapts to the content structure (e.g., segmenting few-shot examples).

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanisms
Position encodings (RoPE, ALiBi)
Cognitive Load Theory (basic familiarity helpful)

Key Terms

RePo: Context Re-Positioning—the proposed method that learns to assign continuous position values to tokens based on their content

RoPE: Rotary Position Embedding—a method encoding position information by rotating query and key vectors in embedding space

Extraneous Load: In Cognitive Load Theory, the mental effort imposed by the way information is presented or organized, which distracts from the actual learning or reasoning task

Germane Load: Mental effort dedicated to processing information and constructing schemas (useful reasoning), which RePo aims to maximize by reducing extraneous load

NoPE: No Position Encoding—a baseline where explicit position information is removed

NIAH: Needle-In-A-Haystack—a benchmark task testing a model's ability to retrieve a specific piece of information ('needle') buried in a long context ('haystack')

SwiGLU: A gated activation unit combining Swish activation and Gated Linear Units, used here to extract position representations

YaRN: Yet another RoPE for Nontraditional context—a method to extend the context window of RoPE-based models by modifying frequency components