Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

📝 Paper Summary

Modularized RAG pipeline Retrieval efficiency

RETOMATON approximates expensive nearest-neighbor searches in retrieval-based language models by traversing a weighted finite automaton built from datastore clusters and pointers.

Core Problem

Retrieval-based language models (like kNN-LM) perform a computationally expensive nearest-neighbor search over a massive datastore for every single generated token, causing severe inference latency.

Why it matters:

High inference costs hinder the practical deployment of retrieval-based models despite their accuracy gains
Standard kNN search is repetitive; retrieving a specific neighbor at step t strongly suggests which neighbors will be relevant at t+1, but current models ignore this structure
Current approximate methods often sacrifice perplexity for speed or require training auxiliary networks

Concrete Example: In the phrase 'The U.S. president is Joe Biden', if the model retrieves the context for 'The U.S. president is', the next token 'Joe' and its subsequent context are physically adjacent in the training corpus. Standard kNN-LM ignores this and re-searches the whole datastore for 'Joe', whereas RETOMATON simply follows a pointer to the next entry.

Key Novelty

RETOMATON (Retrieval Automaton)

Constructs a Weighted Finite Automaton (WFA) over the datastore by saving pointers between consecutive text entries and clustering similar contexts into states
Replaces frequent kNN searches with graph traversal: if the model follows a known path (pointer), it skips the search; if it deviates, it restarts with a fresh kNN search
Completely unsupervised construction that requires no additional training or auxiliary networks, unlike concurrent adaptive retrieval methods

Architecture

Concept of RETOMATON traversal. Shows how context 'The U.S.' triggers a kNN search, finding 'president' nodes. Instead of searching again for 'Joe', the model follows pointers (transitions) within the automaton.

Evaluation Highlights

Saves 81% of nearest neighbor searches on WIKITEXT-103 while matching standard kNN-LM perplexity
Reduces perplexity by 1.85 (from 16.65 to 14.80 roughly estimated from chart, or 16.08 at FoSS=0) on WIKITEXT-103 when used purely for accuracy enhancement
Outperforms fine-tuning on Law-MT domain adaptation: 17.5% relative perplexity reduction compared to a fine-tuned Transformer baseline

Breakthrough Assessment

8/10

Offers a significant efficiency breakthrough for kNN-LMs without requiring extra training. The graph-based approximation is an elegant neuro-symbolic solution to a brute-force bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling with access to an external datastore (key-value pairs of context embeddings and target tokens)

Inputs: Context sequence c(t) at time step t

Outputs: Probability distribution over the vocabulary for the next token w

Pipeline Flow

Datastore Construction: Save (Key, Value, Pointer) triplets
Automaton Construction: Cluster keys into States
Inference: Retrieve initial neighbors → Traverse Automaton (follow pointers) → Restart search if confidence/coverage drops

System Modules

Datastore Builder (Offline Construction)

Creates the index of context-token pairs, augmenting each with a pointer to the next consecutive entry in the corpus

Model or implementation: Same architecture as Base LM

Clustering Engine (Offline Construction)

Groups datastore keys into discrete states to form the nodes of the automaton

Model or implementation: k-means or Greedy Clustering

Traversal Engine

Decides whether to perform full kNN search or transition to next states based on pointers

Model or implementation: Algorithm utilizing threshold τ

Novel Architectural Elements

Augmentation of standard kNN-LM datastore with 'next-entry' pointers
Structuring the flat datastore as a Weighted Finite Automaton via unsupervised clustering
Hybrid inference mechanism switching between expensive kNN retrieval and cheap graph traversal

Modeling

Base Model: Transformer (247M parameters for WikiText-103, 656M for Law-MT)

Training Method: Unsupervised clustering (k-means) on frozen representations

Training Data:

WIKITEXT-103 (103M tokens)
Law-MT (19M tokens)

Key Hyperparameters:

k_clusters_wikitext: 1,000,000
k_clusters_law: 200,000
knn_k: 1024
+ 2 more
max_knns_traversal: 1024
lambda_interpolation: Depends on dataset (standard kNN-LM parameter)

Compute: Run on 32 CPU cores and RTX 3090 or v100 GPUs. Inference saves up to 83% of kNN searches.

Comparison to Prior Work

vs. kNN-LM: RETOMATON adds structure (pointers/clusters) to avoid search at every step; kNN-LM searches every token
vs. ADAPT-RET: RETOMATON uses unsupervised pointers to approximate the retrieval distribution when skipping search; ADAPT-RET falls back to the base LM entirely when skipping search
vs. GNN-LM [not cited in paper]: RETOMATON builds the graph offline and unsupervised; GNN-LM typically trains graph networks on top of retrieval results

Limitations

Depends on the quality of the underlying datastore; if the datastore is poor, the automaton will propagate poor predictions
Clustering introduces a hyperparameter (number of clusters) that affects granularity and performance
Memory overhead of storing pointers and cluster assignments (though minimal compared to the dense vectors)
Wall-clock speedup depends heavily on hardware and specific kNN library optimization (FAISS)

Reproducibility

Code: https://github.com/neulab/retomaton

Code and trained models publicly available at https://github.com/neulab/retomaton. Uses FAISS for retrieval and clustering. Reproducible metrics (FoSS) preferred over wall-clock time.

📊 Experiments & Results

Evaluation Setup

Autoregressive language modeling and domain adaptation

Benchmarks:

WIKITEXT-103 (Language Modeling)
Law-MT (Domain Adaptation (English part))

Metrics:

Perplexity (lower is better)
FoSS (Fraction of Saved Searches)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WIKITEXT-103 results showing RETOMATON maintains perplexity while saving searches, or improves it when running fully.
WIKITEXT-103	Perplexity	16.65	16.65	0.00
WIKITEXT-103	Perplexity	16.65	16.08	-0.57
Domain Adaptation on Law-MT dataset compared to kNN-LM and ADAPT-RET.
Law-MT	Perplexity	12.34	10.49	-1.85
Law-MT	Perplexity	12.01	10.49	-1.52
Improving Fine-Tuned Models on Law-MT.
Law-MT	Perplexity	8.61	7.10	-1.51

Experiment Figures

Perplexity vs. Fraction of Saved Searches (FoSS) on WIKITEXT-103. Comparison between kNN-LM, ADAPT-RET, and RETOMATON.

Perplexity vs. FoSS on Law-MT (Domain Adaptation).

Main Takeaways

RETOMATON effectively decouples retrieval accuracy from retrieval frequency; it can maintain high accuracy even when skipping >80% of searches
The automaton structure provides value beyond speed: explicitly modeling transitions (pointers) improves perplexity even when searching every step (FoSS=0)
Crucial for domain adaptation: Unlike ADAPT-RET which reverts to the (poor) base LM when skipping search, RETOMATON continues to approximate the domain-specific kNN distribution via the automaton
Robustness: While baselines degrade exponentially as search frequency drops, RETOMATON's performance degradation is much gentler

📚 Prerequisite Knowledge

Prerequisites

kNN-LM (k-Nearest Neighbors Language Model)
Finite Automata theory (states, transitions)
Vector clustering (k-means)
Language Model perplexity

Key Terms

kNN-LM: A language model that interpolates predictions from a base neural LM with a distribution formed by retrieving nearest neighbors from a datastore

Datastore: A key-value store where keys are vector representations of text contexts (from the LM) and values are the subsequent tokens

FoSS: Fraction of Saved Searches—the percentage of time steps where an expensive kNN search is skipped in favor of automaton traversal

WFA: Weighted Finite Automaton—a graph where edges have weights; here, weights are dynamic based on vector distances

RETOMATON: The proposed system that structures the datastore as an automaton to enable efficient traversal

Pointer: A direct link saved during datastore creation connecting a datastore entry (context, token) to the entry immediately following it in the corpus

Clustering: Grouping similar context vectors into 'states' to allow the automaton to generalize beyond exact verbatim memorization