Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation

📝 Paper Summary

Agentic RAG pipeline

RPG is an iterative framework that alternates between generating a high-level plan and retrieving fine-grained evidence to guide long-form text generation, preventing topic drift caused by irrelevant retrieved content.

Core Problem

Standard RAG systems often retrieve entire documents containing off-topic paragraphs, which can mislead LLMs during long generation tasks, causing the model to drift from the main topic.

Why it matters:

Lengthy retrieved documents often contain irrelevant details that distract the model, leading to factual errors and hallucinations
Single-step retrieval fails to adapt to the evolving information needs of long-form answers
Existing dynamic retrieval methods struggle to filter out irrelevant specific details within generally relevant documents

Concrete Example: When asked 'How do jellyfish function without brains?', a standard RAG model retrieves a document mentioning jellyfish lifespans and bioluminescence. The model gets distracted by these details, generating text about 'illuminating the dark' instead of focusing on the nervous system mechanisms requested by the user.

Key Novelty

Iterative Retrieve-Plan-Generate (RPG) Framework

Decomposes generation into alternating 'Plan' and 'Answer' stages: the model first predicts a specific sub-topic (Plan), then selects relevant sentences (fine-grained evidence) to generate that section
Uses a multi-task prompt tuning strategy where a single frozen LLM learns distinct 'Plan' and 'Answer' prompts simultaneously, sharing a soft prompt base but using different low-rank projections

Architecture

The RPG training and inference pipeline. Left: Training two task-specific prompts (Plan, Answer) using masked losses on the same data. Right: Inference loop alternating between Plan generation, Fine-grained Paragraph selection, and Answer generation.

Evaluation Highlights

+1.8 to +2.7 ROUGE-L improvement over Self-RAG baseline on ASQA (long-form QA)
+8.5 F1 score improvement over Self-RAG on 2WikiMultiHopQA (multi-hop reasoning)
Generalizes to short-form tasks with +1.0% accuracy on PubHealth compared to Self-RAG

Breakthrough Assessment

7/10

Strong improvements in long-form generation coherence by explicit planning. The multi-task prompt tuning approach is an efficient way to add planning capabilities to frozen LLMs.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-intensive text generation where user input x requires retrieving documents D to generate answer y

Inputs: User query x, Retrieved documents Dx

Outputs: Generated response y (iteratively constructed via plan-answer cycles)

Pipeline Flow

Plan Stage: Model generates a 'Plan' token (sub-topic) using Plan Prompt
Retrieval/Selection: Select fine-grained sentences relevant to the Plan from retrieved documents
Answer Stage: Model generates content for the current Plan using Answer Prompt
Loop: Repeat Plan-Answer cycle until termination

System Modules

Planner

Generate a short plan (topic) for the next segment of text

Model or implementation: Llama-2-7B (Frozen) + Plan Prompt

Fine-grained Selector

Select specific sentences from coarse retrieved documents that match the generated Plan

Model or implementation: bge-reranker (Inference) / ChatGPT (Data Construction)

Answer Generator

Generate the actual text response based on the plan and selected evidence

Model or implementation: Llama-2-7B (Frozen) + Answer Prompt

Novel Architectural Elements

Iterative Plan-Answer loop embedded directly into the generation process
Dual-prompt architecture (P_plan, P_ans) switching dynamically within a single generation session based on the current state

Modeling

Base Model: Llama-2-7B

Training Method: Multi-task Prompt Tuning (MPT)

Objective Functions:

Purpose: Train the plan-specific prompt to generate accurate plan tokens.

Formally: L_plan = - sum(log P(y_i | x; Theta, P_plan)) where y_i are plan tokens
Purpose: Train the answer-specific prompt to generate text using evidence.

Formally: L_ans = - sum(log P(y_i | x; Theta, P_ans)) where y_i are answer tokens

Adaptation: Prompt Tuning (Learnable tokens only, backbone frozen)

Trainable Parameters: Soft prompt P* and task-specific low-rank matrices W_plan, W_ans

Training Data:

Reconstructed 50k dataset from Self-RAG and HotpotQA
Used ChatGPT to synthesize 'Plan' labels by summarizing answer segments
Used ChatGPT to filter 'Fine-grained evidence' relevant to plans

Key Hyperparameters:

plan_token_limit: 30 (inference)
answer_token_limit: 100 (inference)
max_operations: 3 (iterations limit)

Compute: Training on 4 Nvidia A6000 GPUs

Comparison to Prior Work

vs. Self-RAG: RPG uses explicit forward-looking planning tokens rather than backward-looking reflection/critique tokens
vs. FLARE: RPG plans *before* generation rather than reacting to low confidence *during* generation
vs. SuRe: RPG iterates plan-answer cycles for long text, whereas SuRe summarizes retrieval once [not cited in paper]
+ 1 more
vs. RECOMP: RPG filters fine-grained evidence iteratively, while RECOMP compresses context in a single step

Limitations

Experiments limited to Llama-2-7B; larger models (13B, 70B) not tested due to resource constraints
Data construction relies on ChatGPT (distillation), which incurs API costs
Inference latency may be higher due to iterative retrieval and planning steps (though specific latency numbers are not reported)

Reproducibility

Code: https://github.com/haruhi-sudo/RPG

Code and models publicly available at https://github.com/haruhi-sudo/RPG. Uses Llama-2-7B as base. Retriever uses Contriever-MS MARCO and BM25. Data construction relied on ChatGPT (GPT-3.5-turbo).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on knowledge-intensive tasks using trained prompts

Benchmarks:

ASQA (Long-form QA (Ambiguous questions))
ELI5 (Long-form QA (Explain Like I'm 5))
2WikiMultiHopQA (Multi-hop QA)
PopQA (Short-form QA)
PubHealth (Fact verification/QA)

Metrics:

ROUGE-L
MAUVE
F1 score
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Long-form generation results show RPG outperforming baselines in relevance and coherence.
ASQA	ROUGE-L	35.7	37.6	+1.9
ASQA	MAUVE	74.3	84.4	+10.1
ELI5	ROUGE-L	17.9	19.1	+1.2
Multi-hop and Short-form results demonstrate versatility beyond long-form tasks.
2WikiMultiHopQA	F1	25.1	33.6	+8.5
PubHealth	Accuracy	72.4	73.4	+1.0
Ablation studies confirm the necessity of planning and multi-task learning.
ASQA	ROUGE-L	32.0	37.6	+5.6
ASQA	ROUGE-L	34.1	37.6	+3.5

Experiment Figures

Performance trends (Pub Accuracy, 2Wiki F1, ASQA ROUGE) vs Training Data Size (10k, 30k, 50k).

Main Takeaways

Iterative planning significantly reduces 'focus shift' in long-form generation by grounding each segment in a specific sub-topic.
Filtering retrieved documents at the fine-grained (sentence/paragraph) level based on plans is more effective than feeding full documents, as it removes irrelevant noise.
Multi-task prompt tuning is an effective strategy to imbue frozen LLMs with dual capabilities (planning and answering) without full fine-tuning.
RPG is versatile, showing gains not just in long-form generation but also in multi-hop reasoning and short-form QA.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architectures
Parameter-Efficient Fine-Tuning (PEFT) methods like Prompt Tuning
Understanding of long-form and multi-hop QA tasks

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Prompt Tuning: A parameter-efficient fine-tuning method that appends learnable continuous vectors (soft prompts) to the input while keeping the model frozen

Multi-task Prompt Tuning: Training separate prompt vectors for different tasks (e.g., planning vs. answering) that share a common underlying representation

ROUGE-L: Evaluation metric measuring the longest common subsequence between generated and reference text

MAUVE: A metric measuring the gap between neural text and human text distributions

Self-RAG: A baseline RAG method that uses reflection tokens to critique retrieved documents and generated content

Hadamard product: Element-wise multiplication of two matrices

Soft prompt: Learnable continuous vectors prepended to the input embeddings of a language model to condition its behavior

Contriever: A dense information retrieval model used to find relevant documents

greedy decoding: A generation strategy where the model selects the highest probability token at each step