RECITE: Recitation-Augmented Language Models

📝 Paper Summary

Modularized RAG pipeline Knowledge internalization

RECITE improves Large Language Models' ability to answer factual questions by prompting them to first self-generate (recite) relevant knowledge passages from their own parameters before producing an answer.

Core Problem

LLMs struggle to directly retrieve exact factual knowledge from their parameters for QA tasks because the task format differs from the causal language modeling pre-training objective.

Why it matters:

Direct generation often leads to hallucinations or incorrect answers even if the model 'knows' the fact in its weights
Retrieval-augmented models require external corpora and indexes, which may not always be available or up-to-date
Standard few-shot prompting does not leverage the intermediate 'study' or 'recitation' step humans use to recall complex facts

Concrete Example: When asked 'What is the tenth decimal of pi?', a model might fail to answer '5' directly. However, it can successfully complete the sequence 'The first 10 digits of pi are 3.1415926535...' and then deduce the answer. RECITE mimics this intermediate step.

Key Novelty

Recitation-Augmented Generation (RECITE)

Decomposes QA into two steps: (1) Recitation (generating relevant passages from model memory) and (2) Answering (using the recited passage to answer the question)
Leverages 'fuzzy memorization' where the model generates approximate but factually correct context, rather than retrieving exact strings from a database
Utilizes passage hints (section titles) to diversify the generated recitations, ensuring coverage of different potential knowledge sources within the model

Architecture

Conceptual flow of the Recitation-Augmented Generation process compared to direct generation

Evaluation Highlights

Achieves state-of-the-art performance on Natural Questions (64-shot) with 31.34 EM using PaLM-62B, surpassing direct generation (28.98 EM)
Outperforms standard prompting on TriviaQA (5-shot) with Codex, achieving 83.50 EM compared to 81.84 EM
RECITE with PaLM-62B (4-shot) scores 26.46 EM on HotpotQA, outperforming standard prompting (20.51 EM) and Chain-of-Thought (23.73 EM)

Breakthrough Assessment

7/10

Strong conceptual contribution showing LLMs can self-retrieve knowledge without external indices. Significant gains on closed-book QA, though reliance on model scale and hallucination risks in recitation remain limitations.

⚙️ Technical Details

Problem Definition

Setting: Few-shot Closed-Book Question Answering (CBQA)

Inputs: A natural language question Q and a few exemplar pairs

Outputs: A generated answer A

Pipeline Flow

Recitation Generation (Recite relevant passages)
Answer Generation (Condition on Recitation)
Ensemble (Self-consistency voting)

System Modules

Evidence-Recitation Module

Generate relevant passages from the model's own memory via sampling

Model or implementation: LLM (PaLM, UL2, OPT, or Codex)

Question-Answering Module

Generate the final answer based on the recited evidence

Model or implementation: Same LLM as Recitation Module

Self-Consistency Ensemble

Select final answer via majority vote over multiple sampled recitation-answer paths

Model or implementation: Deterministic algorithm

Novel Architectural Elements

Two-step 'Recite-and-Answer' inference topology where the model queries its own weights for intermediate evidence before answering
Multiple-recite-and-answer topology for multi-hop QA, generating sequential recitations covering different topics in one pass

Modeling

Base Model: PaLM-62B, UL2-20B, OPT-30B, Codex (code-davinci-002)

Training Method: Fine-tuning on synthetic data (for Passage Hint variant only)

Objective Functions:

Purpose: Train the model to map questions to specific passage hints (section titles).

Formally: Standard language modeling loss on (Question, Passage Hint) pairs.
Purpose: Train the model to generate full passages from hints.

Formally: Standard language modeling loss on (Passage Hint, Passage Content) pairs.

Training Data:

Synthetic data generated using few-shot prompting on top-retrieved Wikipedia pages for Natural Questions queries
Input: Ground-truth evidence/question pairs
Output: Synthetic Question-Hint-Passage triplets

Key Hyperparameters:

sampling_k_paths: 20
top_k_sampling: 40 (implied, typical for top-k)
temperature: 0.7 (implied for diverse sampling)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Atlas: RECITE retrieves from internal weights, Atlas retrieves from external index
vs. Chain-of-Thought: RECITE focuses on factual knowledge recitation (evidence generation), CoT focuses on logical reasoning steps
vs. Standard Prompting: RECITE adds an intermediate evidence generation step

Limitations

Updating time-sensitive knowledge requires costly re-training or fine-tuning compared to updating an external index
Risk of hallucination in the recitation step (generating incorrect 'facts') which propagates to the answer
Computationally expensive due to multiple sampling paths and two-step generation process

Reproducibility

Code: https://github.com/Edward-Sun/RECITE

Code is publicly available at https://github.com/Edward-Sun/RECITE. Model weights for UL2 and OPT are public; Codex is API-based; PaLM is proprietary. Prompts are provided in the appendix.

📊 Experiments & Results

Evaluation Setup

Few-shot Closed-Book Question Answering

Benchmarks:

Natural Questions (NQ) (Single-hop QA (Wikipedia based))
TriviaQA (Trivia QA)
HotpotQA (Multi-hop QA)

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Reported mean and standard deviation over 5 random seeds for robustness analysis

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Natural Questions (NQ) showing RECITE outperforms direct prompting across multiple models.
Natural Questions (NQ)	EM	28.98	31.34	+2.36
Natural Questions (NQ)	EM	31.45	35.84	+4.39
Results on TriviaQA showing consistent improvements, particularly for Codex.
TriviaQA	EM	81.84	83.50	+1.66
Multi-hop reasoning results on HotpotQA, comparing against Chain-of-Thought (CoT).
HotpotQA	EM	20.51	26.46	+5.95
HotpotQA	EM	23.73	26.46	+2.73
Natural Questions (NQ)	EM	31.34	33.23	+1.89

Experiment Figures

Impact of the number of self-consistency paths on TriviaQA performance for OPT-30B and UL2-20B

Main Takeaways

Recite-and-answer consistently outperforms standard direct prompting across PaLM, UL2, OPT, and Codex on closed-book QA tasks
Self-consistency (majority voting over 20 paths) is crucial; performance improves as the number of sampled recitation paths increases
Diversified recitation (using passage hints/section titles) further boosts performance by encouraging the model to recall knowledge from different 'locations' in its memory
Stronger models (like Codex) benefit more from their own recitation than from traditional BM25 retrieval (on NQ), suggesting high-quality internal memory

📚 Prerequisite Knowledge

Prerequisites

In-context learning (few-shot prompting)
Language Model pre-training objectives
Self-consistency / Majority voting strategies

Key Terms

CBQA: Closed-Book Question Answering—answering questions using only the model's internal parameters without access to external documents

Self-consistency: A decoding strategy that samples multiple diverse outputs (reasoning paths) and selects the most consistent final answer via majority vote

Chain-of-Thought: A prompting method that encourages the model to generate intermediate reasoning steps before the final answer

Greedy decoding: A generation strategy where the model always selects the highest probability token at each step

Top-k sampling: A generation strategy that samples the next token from the top k most probable candidates, introducing diversity

Passage Hint: A structured identifier (e.g., section title + paragraph number) used to prompt the model to recite specific, diverse knowledge chunks