Entities as Experts: Sparse Memory Access with Entity Supervision

📝 Paper Summary

Knowledge internalization Memory recall Sparse memory QA

Entities as Experts (EAE) enhances Transformers by learning distinct memory representations for entities from text, accessing them sparsely only when those entities are mentioned.

Core Problem

Standard language models struggle to capture and access declarative knowledge about entities because they must construct representations from sub-word tokens (like "Charles" + "Darwin") rather than accessing a dedicated entity memory.

Why it matters:

Language models are increasingly used as knowledge bases, but standard architectures are inefficient at storing and retrieving specific factual knowledge
Previous methods rely on external, fixed entity embeddings (like Knowledge Graph embeddings), limiting the model to pre-existing knowledge bases rather than learning from text context
Dense access to all parameters for every token is computationally inefficient; sparse access allows scaling model capacity without proportional compute cost

Concrete Example: A standard Transformer sees "Charles Darwin" as separate tokens and might confuse it with "Charles River." EAE identifies the span "Charles Darwin," retrieves a specific learned memory vector for that entity, and uses it to answer questions like "Which Dr. Who villain has been played by...?"

Key Novelty

Entities as Experts (EAE)

Replaces dense parameter access with a sparse 'Entity Memory' layer that contains learned embeddings for specific entities (e.g., one vector for 'Paris', one for 'London')
Uses an internal mention detector to identify entity spans in text and route them to the correct 'expert' (entity memory) during the forward pass
Learns entity embeddings jointly with the rest of the network from raw text, rather than relying on pre-trained external knowledge base embeddings

Architecture

The EAE architecture interleaving Transformer layers with an Entity Memory layer.

Evaluation Highlights

Outperforms a Transformer encoder-generator (T5-Base) on TriviaQA while using 10x fewer parameters
Achieves 43.2% Exact Match on TriviaQA (open-domain), surpassing T5-3B (34.4%) using only ~367M parameters
Surpasses BERT-Large on the LAMA knowledge probe (T-REx subset) by 5.1 points (37.4 vs 32.3) despite having similar total parameter counts

Breakthrough Assessment

8/10

Significant architectural innovation in sparse memory access. Demonstrates that learned entity memories outperform massive dense models (T5-3B) on knowledge-intensive tasks with far fewer parameters.

⚙️ Technical Details

Problem Definition

Setting: Context-aware masked language modeling and entity prediction

Inputs: A sequence of tokens x containing entity mentions m

Outputs: Predicted tokens (for MLM) and predicted entity IDs (for entity linking)

Pipeline Flow

Input Processing (Token Embeddings)
Initial Transformer Layers (Context Encoding)
Entity Memory Layer (Sparse Retrieval & Integration)
Final Transformer Layers (Reasoning)
Prediction Heads (Token & Entity Prediction)

System Modules

Initial Transformer

Encodes the local context of tokens before entity lookup

Model or implementation: Transformer (4 layers)

Entity Memory Layer

Identifies entity spans, retrieves their learned representations, and integrates them back into the token stream

Model or implementation: Differentiable Memory Lookup

Final Transformer

Processes the integrated token-entity representations

Model or implementation: Transformer (8 layers)

Task Heads

Performs masked token prediction and entity linking

Model or implementation: Linear Classifiers

Novel Architectural Elements

Intermediate 'Entity Memory' layer that interrupts the Transformer stack to inject sparse, learned entity representations
Differentiable routing mechanism that learns to construct 'pseudo-entity embeddings' from span representations to query the memory

Modeling

Base Model: Transformer (BERT-base architecture modified to split layers 4/8)

Training Method: Multi-task learning: Masked Language Modeling + Entity Linking + Mention Boundary Detection

Objective Functions:

Purpose: Predict correct tokens for masked spans.

Formally: Cross-entropy loss over vocabulary V
Purpose: Ensure retrieved entity memory matches the ground-truth entity.

Formally: Softmax cross-entropy maximizing dot product between span representation h_mi and correct entity embedding E_mi
Purpose: Identify start/end of mentions.

Formally: Cross-entropy loss over BIO labels

Training Data:

English Wikipedia (32M contexts, 128 tokens)
Vocabulary: 1M most frequent entities
Mentions identified via hyperlinks and Google Cloud NLP API

Key Hyperparameters:

learning_rate: 1e-4
optimizer: Adam
entity_embedding_dim: 256
+ 3 more
top_k_inference: 100
transformer_layers_pre_memory: 4
transformer_layers_post_memory: 8

Compute: Not reported in the paper

Comparison to Prior Work

vs. KnowBERT/ERNIE: EAE learns entity embeddings *from scratch* during training rather than using fixed external embeddings
vs. T5: EAE uses sparse access to explicit entity memories, allowing better performance with far fewer parameters than T5's dense storage
vs. RELIC: EAE models all entities in a text simultaneously (contextualized), whereas RELIC encodes mentions independently

Limitations

Fixed entity vocabulary (cannot handle unseen entities or new entities added after training)
Cannot answer questions where the answer is not a named entity (e.g., dates or common nouns)
Performance drops significantly if the internal mention detector fails to link entities correctly
Computationally expensive if naive top-k is used (though optimized MIPS is possible)

Reproducibility

Code availability is not provided. Pre-processing relies on Google Cloud Natural Language API (commercial). Uses English Wikipedia dump from 2019-04-14.

📊 Experiments & Results

Evaluation Setup

Evaluated on Knowledge Probing (LAMA), Open-Domain QA, and Relation Extraction

Benchmarks:

LAMA (Cloze-style knowledge probing)
TriviaQA (Open-domain Question Answering)
WebQuestions (Open-domain Question Answering)
TACRED (Relation Extraction)

Metrics:

Accuracy (Exact Match)
Perplexity (PPL)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Open-domain QA results showing EAE outperforms massive generative models and previous entity-augmented baselines.
TriviaQA (Open-Book)	Exact Match	42.3	43.2	+0.9
TriviaQA (Closed-Book)	Exact Match	34.4	43.2	+8.8
WebQuestions	Exact Match	37.4	39.0	+1.6
LAMA Knowledge Probing results demonstrate superior factual knowledge storage compared to BERT.
LAMA (T-REx)	Accuracy	32.3	37.4	+5.1
Relation Extraction results showing learned representations compete with explicit entity-aware architectures.
TACRED (Revised)	F1	79.3	80.6	+1.3

Experiment Figures

Performance analysis on TriviaQA based on answer frequency, number of named entities, and question length.

Main Takeaways

Correct identification and linking of entities is critical; performance drops significantly when the mention detector makes errors
Learned entity representations (trained from text) outperform fixed pre-trained embeddings (TransE, Deep-Ed) on knowledge probing tasks
Sparse memory access allows the model to scale capacity for entity knowledge without proportional increases in inference compute (using top-k retrieval)
EAE is highly effective for entity-centric questions but struggles with non-entity answers (dates, common nouns) where standard language models or T5 might still be preferable

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (attention mechanisms)
Masked Language Modeling (BERT)
Entity Linking / Named Entity Recognition
Memory Networks

Key Terms

Entity Memory: A learnable matrix where each row corresponds to a specific entity's representation, accessed only when that entity is mentioned in the text

Sparse Activation: A mechanism where the model only accesses a small subset of its parameters (specific entity memories) for a given input, rather than using all weights

Mention Masking: A training objective where entity names are masked out, forcing the model to use context to predict the correct entity

LAMA: LAnguage Model Analysis—a benchmark for probing the factual/declarative knowledge stored in language models

TriviaQA: A large-scale question answering dataset containing complex, compositional questions

BIO encoding: A tagging scheme (Beginning, Inside, Outside) used to mark the boundaries of entity mentions in text

Entity Linking: The task of assigning a unique identity (from a knowledge base) to an entity mention in text