FreshLLMs: Refreshing LLMs with search engine augmentation

📝 Paper Summary

Modularized RAG pipeline Factuality and hallucination

The paper introduces a dynamic QA benchmark for evaluating LLM factuality on changing world knowledge and proposes a few-shot prompting method that incorporates search engine results to significantly improve accuracy.

Core Problem

Most Large Language Models (LLMs) are trained once and lack the ability to adapt to fast-changing world knowledge, leading to hallucinations or outdated answers.

Why it matters:

Models like ChatGPT and GPT-4 often hallucinate plausible but incorrect information, reducing trustworthiness in settings requiring up-to-date accuracy
Retraining models to update knowledge is not easily scalable for real-time information (e.g., stock prices)
Existing benchmarks do not adequately test dynamic, fast-changing knowledge or the ability to debunk false premises

Concrete Example: When asked 'Which game won the Spiel des Jahres award most recently?', a model trained in 2021 might answer 'MicroMacro: Crime City' (the 2021 winner) instead of the current winner, or refuse to answer due to a knowledge cutoff.

Key Novelty

FreshQA Benchmark and FreshPrompt Method

Creates a dynamic QA benchmark (FreshQA) categorized by how frequently answers change (never, slow, fast) and including false premises, requiring regular updates
Develops FreshPrompt, a few-shot prompting strategy that integrates diverse search engine evidence (organic results, answer boxes, related questions) into the prompt to ground LLM reasoning

Architecture

The format of FreshPrompt, illustrating how search results are structured and fed into the LLM.

Evaluation Highlights

GPT-4 with FreshPrompt achieves +49.0% absolute accuracy improvement over vanilla GPT-4 under STRICT evaluation on FreshQA
FreshPrompt outperforms competing search-augmented methods like Self-Ask (+33.7% accuracy) and Perplexity.ai (+38.7% accuracy) under STRICT evaluation on GPT-4
Increasing retrieved evidences from 1 to 15 improves FreshPrompt accuracy by +16.2% under STRICT evaluation

Breakthrough Assessment

8/10

Significant contribution in benchmarking dynamic knowledge (a major LLM weakness) and providing a strong baseline method that outperforms commercial systems like Perplexity.ai at the time of publication.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering requiring up-to-date world knowledge

Inputs: Natural language question q

Outputs: Factually correct and up-to-date answer Ans_predict

Pipeline Flow

Search Query Generation (uses verbatim question)
Evidence Retrieval (Google Search)
Evidence Formatting & Sorting
Few-Shot Prompt Construction
LLM Inference

System Modules

Search Engine

Retrieve up-to-date information for the input question

Model or implementation: Google Search API (SerpApi)

Prompt Builder

Format retrieved evidences and combine with few-shot demonstrations

Model or implementation: Rule-based formatting

Answer Generator

Reason over evidences to generate the final answer

Model or implementation: GPT-3.5 or GPT-4

Novel Architectural Elements

Integration of diverse search features (organic, answer box, related questions, knowledge graph) into a unified textual format for the prompt
Chronological sorting of retrieved evidences in the prompt (oldest to newest) to bias the model towards recent information

Modeling

Base Model: GPT-4 (primary), GPT-3.5, PaLM, PaLM-2 (PaLM-Chilla), Flan-PaLM, T5, Codex

Comparison to Prior Work

vs. Self-Ask: FreshPrompt uses a single pass with rich evidence context rather than multi-step decomposition
vs. Perplexity.ai: FreshPrompt (with GPT-4) outperforms Perplexity's concise mode on both accuracy and hallucination reduction
vs. Lazaridou et al. (2022) [not cited in paper]: FreshPrompt uses a single inference call with aggregated context, whereas Lazaridou et al. perform 50 inference calls for reranking
+ 1 more
vs. RealTime QA [not cited in paper]: FreshQA uses open-ended questions rather than multiple-choice, allowing for generative evaluation

Limitations

Requires regular manual updates to the benchmark ground-truth answers to remain valid
Relies on Google Search; performance may vary with other search engines lacking specific features (e.g., answer boxes)
Single search query limitation; does not perform query decomposition or multiple searches
Performance evaluation relies on human judgment or LLM-based evaluation (FreshEval), which can be costly or imperfect

Reproducibility

Code: https://github.com/freshllms/freshqa

📊 Experiments & Results

Evaluation Setup

Generative QA evaluated against current ground-truth answers

Benchmarks:

FreshQA (Dynamic Open-Domain QA) [New]

Metrics:

STRICT Accuracy (correct answer, no hallucination)
RELAXED Accuracy (correct main answer, allows minor hallucination)
Statistical methodology: Human evaluation (50K+ judgments) with inter-rater agreement checks; automatic evaluation via FreshEval (LLM-as-a-judge)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison of FreshPrompt against baselines and other search-augmented methods on the full FreshQA dataset.
FreshQA	STRICT Accuracy	28.6	77.6	+49.0
FreshQA	STRICT Accuracy	47.8	77.6	+29.8
FreshQA	STRICT Accuracy	52.2	77.6	+25.4
Analysis of performance on specific question types, highlighting the difficulty of fast-changing information.
FreshQA (Fast-changing)	STRICT Accuracy	26.9	77.1	+50.2
FreshQA (False-premise)	STRICT Accuracy	64.3	94.4	+30.1
Ablation studies on the number of evidences and their ordering in the prompt.
FreshQA	STRICT Accuracy	61.4	77.6	+16.2
FreshQA	STRICT Accuracy	72.4	74.8	+2.4

Experiment Figures

Accuracy of various LLMs (T5, PaLM, GPT families) on FreshQA under Relaxed vs. Strict evaluation.

Main Takeaways

All non-augmented LLMs struggle significantly on fast-changing and false-premise questions, showing flat scaling curves (size doesn't solve freshness).
Strict evaluation reveals high rates of hallucination in standard models; search augmentation with FreshPrompt drastically reduces the gap between Strict and Relaxed accuracy.
The number of retrieved evidences is a key performance driver; GPT-4 effectively aggregates information from up to 15 search results.
Explicitly instructing models to check for false premises ('Please check if the question contains a valid premise') boosts accuracy on false-premise questions but can degrade performance on valid ones for weaker models.

📚 Prerequisite Knowledge

Prerequisites

In-context learning / Few-shot prompting
Retrieval-Augmented Generation (RAG)
Basic understanding of search engine result structures (snippets, knowledge graph, etc.)

Key Terms

FreshQA: A novel dynamic QA benchmark with 600 questions divided into categories based on answer stability (never-changing, slow-changing, fast-changing, false-premise)

FreshPrompt: A few-shot prompting method that injects search engine results (snippets, answer boxes, related questions) into the LLM context to improve factuality

STRICT evaluation: An evaluation mode where a response is only credited if the main answer is correct AND it contains zero hallucinations or outdated information

RELAXED evaluation: An evaluation mode where a response is credited if the primary answer is correct, even if it contains minor hallucinations or outdated details

hallucination: Plausible but factually incorrect information generated by an LLM

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

Self-Ask: A prompting method that teaches an LLM to decompose questions into sub-questions and answer them using search results

knowledge cutoff: The date up to which an LLM's training data extends; the model generally lacks knowledge of events after this date

organic results: The standard list of web page links and snippets returned by a search engine, excluding special features like ads or answer boxes