End-To-End Memory Networks

📝 Paper Summary

Memory-augmented neural networks Recurrent attention mechanisms

The authors propose a continuous, end-to-end trainable neural network that performs multiple computational hops over an external memory to answer questions or predict language tokens.

Core Problem

Previous memory-augmented models required strong supervision (labels for which specific sentences support an answer) at each layer, making them difficult to apply in realistic settings where only input-output pairs are available.

Why it matters:

Realistic datasets rarely provide fine-grained supervision about which specific memory facts are relevant to a query
Standard RNNs/LSTMs struggle to capture very long-term dependencies compared to explicit memory storage
Models need to perform multiple reasoning steps (hops) to deduce answers from disparate pieces of information

Concrete Example: In a story where 'Sam walks into the kitchen' then 'Sam drops the apple', a model asked 'Where is the apple?' must deduce the location. Previous Memory Networks needed training labels explicitly marking 'Sam walks into the kitchen' as the supporting fact. This model learns to find it using only the final answer 'kitchen'.

Key Novelty

End-to-End Memory Network (MemN2N)

Replaces the hard max/ranking operations of previous Memory Networks with a continuous softmax attention mechanism, allowing gradients to backpropagate through memory accesses
Introduces a 'multi-hop' architecture where the model reads from memory multiple times, updating its internal query state after each hop to perform chain-of-thought reasoning

Architecture

A single layer (a) and a stacked multi-layer (b) version of the End-To-End Memory Network.

Evaluation Highlights

Achieves 3.2% mean error on bAbI QA tasks (10k training set), comparable to strongly supervised baselines
Outperforms LSTM baselines on language modeling (Penn Treebank perplexity 111 vs. 115 for RNN/SCRN)
Demonstrates that increasing memory hops (from 1 to 3+) consistently improves performance on both QA and language modeling tasks

Breakthrough Assessment

9/10

A foundational paper in memory-augmented networks. It introduced the standard attention-based memory mechanism used extensively later (e.g., in Transformers) and proved explicit memory could be trained end-to-end without strong supervision.

⚙️ Technical Details

Problem Definition

Setting: Synthetic Question Answering and Language Modeling

Inputs: A set of discrete inputs x1...xn (memory entries) and a query q

Outputs: An answer a (or next word prediction)

Pipeline Flow

Input Embedding (convert memory items x and query q to vectors)
Attention/Memory Lookup (compute match between query and memory slots)
Weighted Sum (retrieve output vectors based on attention weights)
State Update (combine query and retrieved content to form new query)
Repeat for K hops
Final Prediction (softmax over vocabulary)

System Modules

Input Memory Representation (Memory Encoding)

Embed input sentences into memory vectors

Model or implementation: Embedding Matrix A (d x V)

Query Embedding (Memory Encoding)

Embed the question/query into an internal state

Model or implementation: Embedding Matrix B (d x V)

Output Memory Representation (Memory Retrieval)

Embed input sentences into output vectors (separate from input vectors)

Model or implementation: Embedding Matrix C (d x V)

Attention Mechanism (Memory Retrieval)

Calculate probability/relevance of each memory slot given query u

Model or implementation: Softmax(u^T * mi)

Controller/Recurrence

Update the query state using retrieved information for the next hop

Model or implementation: Linear summation or Linear layer H (u_k+1 = u_k + o_k)

Novel Architectural Elements

Continuous memory addressing via Softmax rather than Hard Max/Argmax
Multi-hop recurrent updating of the query vector u, where the output of one hop becomes the input to the next
Layer-wise weight tying schemes (Adjacent or RNN-like) to constrain parameters across hops
Temporal Encoding (TA/TC matrices) to explicitly model the relative order of memory entries

Modeling

Base Model: Custom Memory Network Architecture (MemN2N)

Training Method: Supervised learning (End-to-End)

Objective Functions:

Purpose: Minimize prediction error.

Formally: Standard cross-entropy loss between predicted label â and true label a

Training Data:

bAbI QA dataset v1.1 (1k and 10k versions per task)
Penn Treebank (929k train words)
Text8 (93.3M train characters)

Key Hyperparameters:

learning_rate: 0.01
batch_size: 32
epochs: 60 (1k set) / 20 (10k set)
+ 4 more
embedding_dimension: 20 (independent training) or 50 (joint training)
hops: 3 (QA defaults), up to 7 (Language Modeling)
initialization: Gaussian (mean 0, sigma 0.1)
gradient_clip_norm: 40 (QA) or 50 (LM)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemNN: MemN2N is end-to-end differentiable (using softmax), requiring far less supervision (no need for supporting fact labels)
vs. RNNsearch: MemN2N performs multiple hops over the source sentence/memory per output symbol, whereas RNNsearch typically does one
vs. Neural Turing Machine: MemN2N uses a simpler addressing mechanism (content-only + temporal encoding) and is applied to textual reasoning rather than sorting/copying algorithms

Limitations

Still underperforms strongly supervised MemNN on some complex reasoning tasks (e.g., path finding)
High variance in performance depending on initialization (requires multiple runs to select best)
Soft attention lookup (scanning entire memory) may not scale well to very large memories compared to hashing or hard attention
Positional Encoding (PE) is a fixed heuristic representation, not a learned recurrent processing of the sentence text itself

Reproducibility

Code: https://github.com/facebook/MemNN

Publicly available code (https://github.com/facebook/MemNN). Datasets are standard (bAbI, PTB, Text8). 'Linear Start' (LS) training (removing softmax initially) is crucial for avoiding local minima in some tasks. Random noise injection (dummy memories) helps regularize temporal encoding.

📊 Experiments & Results

Evaluation Setup

Synthetic Question Answering on bAbI dataset and Language Modeling on PTB/Text8

Benchmarks:

bAbI Dataset (Synthetic QA (20 tasks))
Penn Treebank (PTB) (Language Modeling)
Text8 (Language Modeling)

Metrics:

Error rate (%)
Perplexity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
QA performance on bAbI tasks (1k training examples). MemN2N approaches strongly supervised performance and beats weakly supervised baselines.
bAbI (1k train)	Mean Error (%)	6.7	12.6	+5.9
bAbI (1k train)	Mean Error (%)	51.3	12.6	-38.7
QA performance on bAbI tasks (10k training examples). With more data, the gap to strong supervision narrows significantly.
bAbI (10k train)	Mean Error (%)	3.2	4.2	+1.0
Language Modeling results showing MemN2N outperforms tuned RNN/LSTM baselines.
Penn Treebank	Test Perplexity	129	111	-18
Penn Treebank	Test Perplexity	115	111	-4
Text8	Test Perplexity	154	147	-7
Impact of multiple hops on Language Modeling.
Penn Treebank	Valid. Perplexity	128	120	-8

Experiment Figures

Visualization of attention weights (p) for 3 memory hops on QA tasks.

Average activation weight of memory positions during 6 memory hops on Language Modeling tasks.

Main Takeaways

Multiple computational hops are crucial for performance; results consistently improve as hops increase from 1 to 3+.
Position Encoding (PE) and Linear Start (LS) are critical architectural/training choices; LS prevents local minima in difficult tasks.
The model successfully learns to attend to relevant sentences without explicit supervision labels, as visualized in attention weight heatmaps.
For language modeling, the attention mechanism operates like a smoothed n-gram model + cache, with hops alternating between recent words and broader context.

📚 Prerequisite Knowledge

Prerequisites

Basic Recurrent Neural Networks (RNNs) and LSTMs
Embedding layers (converting discrete tokens to vectors)
Softmax function and backpropagation
Bag-of-words vs. Position-aware representations

Key Terms

Memory Network: A class of neural networks with an explicit external memory component that can be read from and written to

Softmax: A function that converts a vector of numbers into a probability distribution

bAbI: A set of 20 synthetic question-answering tasks designed to test different types of reasoning (deduction, induction, counting, etc.)

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

Strong supervision: Training where the model is told exactly which sentences in a story are relevant to the answer

Weak supervision: Training where the model is only given the final answer and must figure out which inputs were relevant

Hop: A single computational step of reading from memory and updating the internal state

BoW: Bag-of-Words—a representation of text that disregards grammar and word order but keeps multiplicity

PE: Position Encoding—a method to inject word ordering information into the embedding by weighting words based on their position in the sentence

RNNsearch: An earlier neural machine translation architecture using attention, similar to the mechanism used here

LS: Linear Start—a training trick where softmax layers are initially removed (making the model linear) to avoid local minima