Check Your Facts and Try Again: Improving LLMs with External Knowledge and Automated Feedback

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline

LLM-Augmenter iteratively improves a fixed black-box LLM's responses by verifying them against external knowledge and generating automated feedback to revise the prompts until validation passes.

Core Problem

LLMs like ChatGPT suffer from hallucinations due to lossy knowledge encoding and cannot access up-to-date or proprietary data stored in external databases.

Why it matters:

Hallucinations in mission-critical applications cause damage and erode trust
Frequent real-world changes make fixed LLM weights quickly stale (e.g., news)
Fine-tuning massive LLMs for every domain is prohibitively expensive and privacy-invasive

Concrete Example: When asked about a 2013 LA Galaxy player transfer, a standard LLM might confidently invent a player. LLM-Augmenter retrieves the transfer table, sees the LLM's guess is unsupported, generates feedback ('no info about titles'), and forces the LLM to try again with corrected context.

Key Novelty

Plug-and-Play (PnP) Augmentation with Feedback Loop

Augments a frozen black-box LLM (ChatGPT) with external modules (Policy, Knowledge Consolidator, Utility) without fine-tuning the LLM itself
Introduces an iterative feedback loop where a Utility module critiques the LLM's candidate response against evidence, prompting the LLM to revise its answer if it hallucinates

Architecture

The LLM-Augmenter system architecture illustrating the interaction between the User/Environment and the components: Working Memory, Policy, Action Executor (Knowledge Consolidator + Prompt Engine), Utility, and the fixed LLM.

Evaluation Highlights

+10.0% F1 improvement on open-domain Wiki QA (OTT-QA) compared to closed-book ChatGPT
Reduces hallucination significantly: +32.3% improvement in human-rated 'Usefulness' on Customer Service dialogs
Policy learning via RL surpasses random baselines, reaching ~37.5 Knowledge F1 on customer service tasks

Breakthrough Assessment

8/10

Significant for being one of the first systems to combine external knowledge retrieval with an iterative verification-feedback loop for black-box LLMs like ChatGPT, explicitly addressing hallucination without fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Human-system conversation modeled as a Markov Decision Process (MDP) tuple (S, A, P, R, γ)

Inputs: User query q and dialog history h_q

Outputs: Final system response generated by the LLM after potential iterative revisions

Pipeline Flow

Policy selects action (retrieve or answer)
Knowledge Consolidator retrieves and refines evidence
Prompt Engine queries LLM with evidence
Utility Module verifies response and generates feedback
Iterative Loop: Feedback revises prompt -> LLM regenerates -> Utility verifies

System Modules

Working Memory

Tracks dialog state including query, evidence, candidate responses, utility scores, and feedback

Model or implementation: Structured state tuple (q, e, o, u, f, h)

Policy

Selects next action: call Knowledge Consolidator, call Prompt Engine, or send response to user

Model or implementation: T5-Base (fine-tuned via RL)

Knowledge Consolidator

Retrieves raw evidence (Web/DB) and links/prunes it into consolidated evidence chains

Model or implementation: BM25 or DPR (retriever) + CORE (linker/chainer)

Prompt Engine

Constructs prompts combining instruction, query, evidence, and feedback

Model or implementation: Rule-based templates

Utility

Scores response quality (factuality) and generates verbal feedback if score is low

Model or implementation: KF1 (score) + Template or ChatGPT (feedback generation)

Novel Architectural Elements

Iterative feedback loop where a Utility module critiques the black-box LLM's output and forces regeneration via prompt revision
Separation of the 'Policy' (which decides *when* to retrieve/answer) from the LLM (which just generates text)

Modeling

Base Model: ChatGPT (frozen)

Training Method: Reinforcement Learning (REINFORCE algorithm)

Objective Functions:

Purpose: Maximize expected reward (utility of final response).

Formally: J(θ) = E[R(s, a)] optimized via gradient ascent.

Trainable Parameters: Policy module (T5-Base) is trainable; Knowledge Consolidator and Utility can also be optimized

Training Data:

Trained on simulated user interactions derived from DSTC11 and Wiki QA datasets

Key Hyperparameters:

discount_factor_gamma: Not utilized (single turn interactions approx)
policy_model: T5-Base

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RAG: Adds an iterative feedback loop where the system explicitly validates and rejects hallucinated responses
vs. WebGPT: Uses automated utility functions (KF1) rather than human preference models for immediate feedback
vs. Toolformer: Treats the policy as a separate module rather than fine-tuning the LLM to call APIs directly

Limitations

Interactive feedback requires querying the LLM multiple times (latency increase)
Relies on the availability and quality of external utility functions (e.g., KF1)
Main experiments use a rule-based policy for ChatGPT due to API costs/limits (RL shown on T5)

Reproducibility

Code: https://aka.ms/llm-augmenter

Source code and models are publicly available at https://aka.ms/llm-augmenter. The paper uses ChatGPT APIs which are closed-source. Customer Service test set was unavailable, so validation set was used for evaluation.

📊 Experiments & Results

Evaluation Setup

Task-oriented Dialog (News Chat, Customer Service) and Open-domain QA (Wiki QA)

Benchmarks:

DSTC7 (News Chat) (Information Seeking Dialog)
DSTC11 (Customer Service) (Task-oriented Dialog)
OTT-QA (Wiki QA) (Multi-hop Open-domain QA)

Metrics:

Knowledge F1 (KF1)
BLEU-4
ROUGE-1
Human Evaluation (Usefulness, Humanness)
Statistical methodology: Human evaluation differences tested for significance (p < 0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on News Chat (DSTC7) shows improvements using consolidated knowledge and feedback.
DSTC7	Knowledge F1 (KF1)	26.71	36.41	+9.70
Performance on Customer Service (DSTC11) demonstrates reduced hallucination.
DSTC11	Knowledge F1 (KF1)	31.33	37.41	+6.08
DSTC11	Knowledge F1 (KF1)	34.07	37.41	+3.34
Open-domain QA (Wiki QA) results highlight the necessity of knowledge consolidation for multi-hop tasks.
Wiki QA	F1	0.59	11.80	+11.21
Wiki QA	F1	2.38	8.08	+5.70

Experiment Figures

Learning curve of the Policy (T5-Base) on the Customer Service task.

Ablation on feedback mechanisms showing KF1 vs number of promptings.

Main Takeaways

Augmenting ChatGPT with external knowledge significantly reduces hallucinations (measured by KF1 and Usefulness).
Automated feedback loops allow the model to self-correct, providing additive gains over simple retrieval-augmentation.
Consolidated evidence (linking entities/reasoning chains) is far more effective than raw retrieved passages for multi-hop QA tasks.
Human evaluation confirms that LLM-Augmenter improves groundedness (Usefulness) without degrading conversational fluency (Humanness).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy Gradient)
Retrieval-Augmented Generation (RAG)
Prompt Engineering principles

Key Terms

LLM-Augmenter: The proposed system architecture that wraps a fixed LLM with modules for knowledge retrieval, policy decision-making, and automated feedback loops.

Knowledge F1 (KF1): A utility metric measuring the token overlap between the generated response and the ground-truth knowledge evidence.

PnP: Plug-and-Play—modules that can be added to a system without retraining the core model.

DPR: Dense Passage Retrieval—a method using dense vector embeddings to retrieve relevant documents.

CORE: Chain of Reasoning—a method (cited from Ma et al.) used here to consolidate raw evidence into structured evidence chains.

REINFORCE: A specific policy gradient algorithm in Reinforcement Learning used to optimize the policy network.