Generation Constraint Scaling Can Mitigate Hallucination

📝 Paper Summary

Memory recall Hallucination suppression

By simply scaling the length of memory readout vectors in the Larimar architecture, hallucination can be significantly reduced without retraining the model.

Core Problem

Large Language Models often hallucinate facts during generation, and existing mitigation techniques like model editing or context-grounding are computationally expensive or require retraining.

Why it matters:

Hallucinations undermine trust in LLMs for factual tasks like biography generation.
Current solutions like GRACE require expensive iterative backpropagation to update model weights or adapters.
There is a need for lightweight, training-free methods to enforce factual consistency using internal model representations.

Concrete Example: When generating a biography for 'Sir John Russell Reynolds', a standard model hallucinates he was 'born in London' (incorrect). The proposed method scales the memory readout vector, forcing the decoder to align with the correct fact ('born in Romsey') stored in memory, producing a factual output.

Key Novelty

Geometry-Aware Vector Scaling in Memory Readouts

Observes that the Larimar decoder geometrically distorts memory readout vectors, shrinking them and altering their angles relative to the original 'write' vectors.
Proposes manually scaling up the magnitude (length) of the readout vector by a fixed factor before feeding it to the decoder.
This simple geometric operation aligns the readout closer to the original memory encoding, effectively constraining the decoder to stick to the stored facts without any training.

Architecture

The Larimar pipeline for hallucination mitigation.

Evaluation Highlights

Achieves 0.72 RougeL score on WikiBio hallucination benchmark, a 46.9% improvement over the GRACE baseline (0.49).
Jaccard similarity improves from 0.33 (base Larimar) to ~0.65-0.70 with scaling factor s=4, significantly outperforming GRACE (0.44).
Synthesis speed is 1-2 orders of magnitude faster than GRACE (3.1s vs 162.5s per entry) due to avoiding iterative backpropagation.

Breakthrough Assessment

7/10

Simple, highly effective training-free intervention with massive speed gains over SOTA model editing. However, heavily reliant on the specific Larimar architecture.

⚙️ Technical Details

Problem Definition

Setting: Constrained text generation where a model must produce output factually consistent with a specific prompt-input pair stored in memory.

Inputs: A prompt (hallucinated sentence) and a target input (correct factual sentence) written to memory.

Outputs: A generated biography entry corrected to match the factual input.

Pipeline Flow

Memory Write: Encoder → Latent Representation → Memory Matrix
Memory Read: Prompt → Encoder → Query Memory → Readout Vector
Geometric Intervention: Readout Vector * Scaling Factor s
Generation: Scaled Readout → Decoder → Output Text

System Modules

Encoder (Memory Interface)

Encodes input text into latent representations for writing to or querying memory.

Model or implementation: BERT-large

Associative Memory (Memory Interface)

Stores compressed representations of text episodes.

Model or implementation: 512x768 Memory Matrix

Geometric Scaler

Scales the magnitude of the readout vector to align it with the write vector geometry.

Model or implementation: Scalar multiplication

Decoder

Generates text conditioned on the memory readout.

Model or implementation: GPT2-large

Novel Architectural Elements

Geometry-aware scaling block inserted between memory readout and decoder injection.

Modeling

Base Model: Larimar-1.3B (BERT-large encoder + GPT2-large decoder)

Training Method: Training-free geometric intervention (vector scaling)

Adaptation: None (inference-time intervention only)

Trainable Parameters: 0 (during intervention)

Key Hyperparameters:

scaling_factor_s: 3 to 4 (empirically determined optimal range)

Compute: Larimar inference: 3.1 seconds per WikiBio entry (on unspecified hardware, implied CPU/GPU mix). GRACE baseline: 162.5 seconds.

Comparison to Prior Work

vs. GRACE: Training-free (vector scaling) vs. optimization-based (backpropagation); 50x faster.
vs. Context-grounding: Manipulates latent memory representations directly rather than token-level context [not cited in paper]
vs. ROME/MEMIT: Does not permanently alter model weights, only conditions generation via memory state [not cited in paper]

Limitations

Strictly dependent on the Larimar architecture (encoder-memory-decoder); not directly applicable to standard decoder-only LLMs without external memory.
Requires empirical tuning of the scaling factor s.
Evaluation limited to one hallucination benchmark (WikiBio).

Reproducibility

Code availability is not explicitly provided in the paper text. The method relies on the Larimar model (which has its own citation/codebase) and the WikiBio hallucination benchmark (publicly available on HuggingFace). Implementation of the scaling is a simple vector operation.

📊 Experiments & Results

Evaluation Setup

Correction of hallucinated Wikipedia biographies using correct facts stored in memory.

Benchmarks:

WikiBio Hallucination Benchmark (Factuality correction / Hallucination mitigation)

Metrics:

RougeL score
Jaccard similarity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WikiBio	RougeL	0.49	0.72	+0.23
WikiBio	Jaccard similarity	0.44	0.65	+0.21
WikiBio Synthesis Time	Seconds per entry	162.5	3.1	-159.4
WikiBio	RougeL	0.39	0.79	+0.40

Experiment Figures

Histograms of geometric properties (distance, angle, norm) between readout, write, and generate vectors.

Impact of scaling factor 's' on RougeL/Jaccard scores and geometric alignment.

Main Takeaways

Geometric alignment (scaling) of memory readouts significantly improves generation factuality without retraining.
Larimar's default decoding process shrinks readout vectors, causing misalignment with stored facts; correcting this magnitude restores fidelity.
The method is vastly more efficient (seconds vs minutes) than optimization-based editing methods like GRACE.
Optimal scaling factor is consistent across samples (around s=4).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Key-Value (KV) caches in Transformers
Basic vector geometry (L2 norm, dot products/angles)
Memory-augmented Neural Networks (MANNs)

Key Terms

Larimar: A specific LLM architecture augmented with an external episodic memory controller, allowing read/write access to latent representations.

readout vector: The vector retrieved from the external memory in Larimar, acting as a compressed KV cache to condition the decoder.

GRACE: Generalized RAdius-based Context Editing—a model editing method that adds a codebook adapter to specific layers to fix errors without changing base weights.

RougeL: A metric measuring the longest common subsequence between reference and generated text, used here to assess factual overlap.

Jaccard similarity: A statistic used for gauging the similarity and diversity of sample sets (tokens in this case).

hallucination: Generated content that is nonsensical or unfaithful to the provided source content/facts.

KV cache: Key-Value cache; stored intermediate states in Transformer models used to speed up generation.

SOTA: State-of-the-Art; the current best performance for a specific task.