FLARE: Active Retrieval Augmented Generation

📝 Paper Summary

Agentic RAG pipeline Modularized RAG pipeline

FLARE improves long-form generation by iteratively anticipating future content, using low-confidence tokens in that anticipation to trigger retrieval, and regenerating the sentence with retrieved context.

Core Problem

Standard RAG methods retrieve only once before generation (insufficient for long texts) or passively retrieve at fixed intervals (often irrelevant or unnecessary), failing to capture evolving information needs during long-form generation.

Why it matters:

Single-time retrieval fails for long-form tasks (e.g., summaries, essays) where information needs shift as the text progresses
Passive multi-time retrieval (e.g., every 16 tokens) is inefficient and often retrieves irrelevant information based on past context rather than future intent
Large Language Models (LLMs) still hallucinate facts when generating long content without access to updated or specific external knowledge

Concrete Example: When generating a summary for 'Joe Biden', a model might correctly state his birth date but hallucinate the date of his 2020 campaign announcement later in the text because the initial retrieval only covered general biography facts, not specific campaign details needed mid-generation.

Key Novelty

Forward-Looking Active REtrieval (FLARE)

Anticipates future information needs by generating a temporary 'hypothetical' next sentence without retrieval
Triggers retrieval ONLY if this hypothetical sentence contains low-confidence tokens (indicating a lack of knowledge)
Uses the hypothetical sentence itself (or questions derived from it) as the search query to fetch relevant documents before regenerating the actual sentence

Architecture

The FLARE workflow showing the iterative process of generating a temporary sentence, checking for low-confidence tokens, retrieving documents if needed, and regenerating.

Evaluation Highlights

+11.6% Exact Match (EM) improvement on 2WikiMultihopQA compared to single-time retrieval baselines
Outperforms passive multi-time retrieval (retrieving every sentence) by +2.0% EM on 2WikiMultihopQA while being more efficient
Achieves superior performance across 4 diverse long-form tasks including Multihop QA, Commonsense Reasoning, and Open-domain Summarization

Breakthrough Assessment

8/10

Introduces a highly intuitive 'active' paradigm that links model confidence to retrieval actions. Significantly improves long-form generation reliability without retraining the base LM.

⚙️ Technical Details

Problem Definition

Setting: Long-form text generation augmented by external knowledge retrieval

Inputs: User input x (e.g., question or topic)

Outputs: Answer y consisting of m sentences [s_1, s_2, ..., s_m]

Pipeline Flow

Initial Retrieval (based on user input x)
Iterative Generation Loop:
1. Generate temporary next sentence (hypothetical)
2. Check confidence of tokens in temporary sentence
3. If confident -> Accept sentence
4. If low confidence -> Use temporary sentence to query Retriever -> Regenerate sentence with new context

System Modules

Base LM

Generates temporary sentences to anticipate needs and final sentences conditioned on retrieved docs

Model or implementation: text-davinci-003

Confidence Monitor

Decides when to trigger retrieval by checking if any token probability falls below threshold theta

Model or implementation: Rule-based

Query Formulator (Retrieval & Selection)

Converts the temporary sentence into a search query (e.g., by masking low-confidence tokens or generating a question)

Model or implementation: Rule-based or LLM-based (gpt-3.5-turbo)

Retriever (Retrieval & Selection)

Fetches relevant documents based on the formulated query

Model or implementation: BM25 (Wikipedia) or Bing Search API (Open Web)

Novel Architectural Elements

Look-ahead active retrieval logic: Generating a 'hypothetical' future sentence solely to determine retrieval intent and query content
Confidence-driven query formulation: Using the low-confidence spans within the hypothetical sentence to target specific information needs

Modeling

Base Model: text-davinci-003

Key Hyperparameters:

inference_decoding_temperature: 0.0
retrieval_threshold_theta: Different per task (e.g., 20% to 80% of sentences trigger retrieval)
masking_threshold_beta: See ablation (0.0 to 0.6 tested)
+ 1 more
max_generation_length: Task dependent (e.g., 256 or 512 tokens)

Compute: Requires iterative calls to OpenAI API; overhead increases due to generating temporary sentences and multiple retrieval steps

Comparison to Prior Work

vs. RETRO/IC-RALM: FLARE retrieves based on *future* intent (next sentence) and *confidence*, rather than past context and fixed intervals
vs. IRCoT: FLARE actively decides *when* to retrieve based on confidence, whereas IRCoT typically retrieves at every reasoning step
vs. Self-Ask: FLARE is generic and does not require task-specific manual annotation of sub-questions in prompts
+ 1 more
vs. ITER-RETGEN [not cited in paper]: ITER-RETGEN refines retrieval and generation iteratively for the *entire* output, while FLARE does it locally sentence-by-sentence during the generation process

Limitations

Increased inference cost and latency due to generating temporary sentences and potentially multiple retrieval calls
Relies on the calibration of the base LM (requires confidence scores to be meaningful)
Did not show significant gains on short-form tasks (Wizard of Wikipedia) or tasks where grounding is difficult (ELI5)

Reproducibility

Code: https://github.com/jzbjyb/FLARE

Code and datasets available at https://github.com/jzbjyb/FLARE. Uses closed-source API models (text-davinci-003, gpt-3.5-turbo) and commercial search (Bing), which may affect exact reproducibility over time.

📊 Experiments & Results

Evaluation Setup

Few-shot in-context learning evaluation on 4 knowledge-intensive datasets

Benchmarks:

2WikiMultihopQA (Multihop QA)
StrategyQA (Commonsense reasoning)
ASQA (Long-form QA (Ambiguous))
WikiAsp (Open-domain summarization)

Metrics:

Exact Match (EM)
F1
Disambig-F1
UniEval (Factual Consistency)
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
2WikiMultihopQA	Exact Match (EM)	39.4	51.0	+11.6
2WikiMultihopQA	Exact Match (EM)	39.0	51.0	+12.0
StrategyQA	Exact Match (EM)	72.9	77.3	+4.4
WikiAsp	UniEval	52.4	53.4	+1.0
2WikiMultihopQA	Exact Match (EM)	39.0	48.8	+9.8

Experiment Figures

Performance (EM) vs. Percentage of sentences triggering retrieval on 2WikiMultihopQA and StrategyQA.

Main Takeaways

Forward-looking retrieval (using the next hypothetical sentence) significantly outperforms looking back at past context for query formulation.
Active retrieval based on confidence is more effective than retrieving at every step or every fixed window, avoiding unnecessary noise.
FLARE generalizes well across different types of long-form tasks (QA, Reasoning, Summarization) without task-specific tuning.
Masking low-confidence tokens in queries helps prevent retrieving distracting or incorrect information (better than using the raw erroneous sentence).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and probability/confidence scores
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of search engines (BM25, Bing API)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by searching for documents before generating responses

FLARE: Forward-Looking Active REtrieval augmented generation—the proposed method that uses hypothetical future sentences to guide retrieval

hallucination: When a language model generates factually incorrect or nonsensical information confidently

single-time retrieval: Standard RAG setup where documents are retrieved once based on the user input before generation starts

passive retrieval: Retrieving information at fixed intervals (e.g., every k tokens) regardless of whether the model needs it

active retrieval: The system dynamically decides when to retrieve based on specific criteria (e.g., low model confidence)

chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

BM25: Best Matching 25—a standard ranking function used by search engines to estimate the relevance of documents to a query

EM: Exact Match—a metric checking if the generated answer text matches the ground truth exactly

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization by comparing to human summaries

UniEval: A metric for evaluating text generation quality, focusing here on factual consistency

zero-shot: The model performs a task without seeing any specific training examples for that task

few-shot: The model is given a small number of examples (shots) in the prompt to understand the task