SeeKeR: Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

📝 Paper Summary

Modularized RAG pipeline Open-domain dialogue Internet-augmented generation

SeeKeR decomposes open-domain generation into three sequential modular tasks (search, knowledge generation, response generation) handled by a single transformer to incorporate up-to-date internet information.

Core Problem

Large language models hallucinate facts, have outdated knowledge frozen at training time, and struggle to aggregate information from multiple retrieved documents into coherent responses.

Why it matters:

Standard LMs cannot learn new facts after training, making them useless for current events
Existing retrieval models (like FiD) often mix facts incorrectly or generate generic responses that ignore retrieved content
Black-box retrieval systems (like LaMDA) are not openly available for research comparison

Concrete Example: When asked about 'Beyonce 2022', a standard model might hallucinate based on old data. SeeKeR generates a search query, retrieves a news snippet about her album release, extracts that specific fact, and then generates a response incorporating it.

Key Novelty

SeeKeR (Search-engine -> Knowledge -> Response)

Decomposes the generation process into three explicit steps: generating a search query, generating/copying a knowledge span from search results, and generating the final response
Uses a single transformer model iteratively for all three modules, feeding the output of one as input to the next
Applies this modular approach to both dialogue (finding relevant facts for conversation) and prompt completion (finding up-to-date news)

Architecture

The modular SeeKeR architecture where a single transformer is invoked three times sequentially.

Evaluation Highlights

Outperforms BlenderBot 2 (3B) on consistency (78.5% vs 65.1%) and knowledge (46.5% vs 27.9%) in human evaluations
Reduces hallucinations on topical prompts compared to GPT2-XL (1.5B), improving 'True' ratings from 14% to 43%
Surpasses GPT3 (175B) in topicality (15-19% vs 4%) and hallucination reduction (58% vs 62% false) on current events prompts despite being 500x smaller

Breakthrough Assessment

8/10

Strong empirical results showing that a modular search-and-generate approach allows much smaller models to outperform giant models (GPT3) on topical accuracy. The decomposition strategy effectively addresses the 'hallucination vs. outdated knowledge' trade-off.

⚙️ Technical Details

Problem Definition

Setting: Open-domain dialogue and language modeling prompt completion using internet search

Inputs: Dialogue history or text prompt context

Outputs: Search query, extracted knowledge span, and final response text

Pipeline Flow

Search Module: Generate search query from context
Search Engine: Execute query (Bing API or Mojeek) -> Get Documents
Knowledge Module: Generate/Copy relevant knowledge span from documents
Response Module: Generate final response using context + knowledge

System Modules

Search Module (Retrieval & Selection)

Generate a search query based on the input context

Model or implementation: Transformer (Encoder-Decoder for dialogue, Decoder-only for LM)

Knowledge Module (Retrieval & Selection)

Select/Generate the most relevant knowledge span from retrieved documents

Model or implementation: Transformer (FiD for dialogue, Concatenation for LM)

Response Module

Generate the final fluent response conditioned on context and selected knowledge

Model or implementation: Transformer (same shared weights as previous modules)

Novel Architectural Elements

Iterative use of a single transformer for three distinct modular tasks (Search, Knowledge, Response)
Explicit intermediate 'Knowledge' generation step between retrieval and final response to filter irrelevant info

Modeling

Base Model: Transformer Encoder-Decoder (2.7B) for dialogue; GPT2 (Medium/Large/XL) for LM

Training Method: Multi-task supervised fine-tuning

Objective Functions:

Purpose: Minimize negative log-likelihood of target sequences.

Formally: Standard seq2seq cross-entropy loss.

Training Data:

Pre-training: pushshift.io Reddit + RoBERTa+CC100en (R2C2)
Search Task: WizInt queries, document titles from Common Crawl
Knowledge Task: WizInt/WoW gold knowledge, QA datasets (SQuAD, NQ, MS MARCO)
Response Task: WizInt/WoW responses, QA answers, dialogue datasets (PersonaChat, BST)

Key Hyperparameters:

model_size_dialogue: 2.7B parameters (Encoder-Decoder)
model_size_lm: 345M, 762M, 1.5B parameters (GPT2 variants)

Compute: Not reported in the paper

Comparison to Prior Work

vs. BlenderBot 2: SeeKeR adds an explicit 'Knowledge Module' to extract relevant spans before generation, whereas BB2 generates directly from retrieved docs
vs. GPT3: SeeKeR uses modular internet search to access real-time info, whereas GPT3 relies on frozen weights
vs. K2R (Adolphs et al. 2021): SeeKeR extends K2R by adding the Search module (internet search) rather than just retrieving from fixed corpus

Limitations

Relies on external search engines (Bing/Mojeek); quality depends on search results
Can still generate repetitive responses or ignore partner input
Inconsistent factual accuracy if retrieved documents contain errors
Evaluating 'engagingness' per-dialogue was statistically difficult compared to per-turn

Reproducibility

Code: http://parl.ai/projects/seeker

Code and models publicly available at http://parl.ai/projects/seeker. Uses Bing Web Search API and Mojeek search engine (external dependencies).

📊 Experiments & Results

Evaluation Setup

Open-domain dialogue generation and topical prompt completion

Benchmarks:

Wizard of Internet (WizInt) (Knowledge-grounded dialogue)
Topical Prompts (January 2022) (Prompt completion on current events) [New]

Metrics:

Perplexity (PPL)
F1
Knowledge F1 (KF1)
Human Evaluation: Consistency, Knowledgeable, Factually Correct, Engagingness
Human Evaluation (Prompts): Sensible, True, Hallucination, Topical
Statistical methodology: Independent two-sample t-test (p < 0.001) for human evaluation results

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human evaluation on open-domain dialogue shows SeeKeR outperforming baselines in consistency and knowledge while reducing factual errors.
Wizard of Internet (Human Eval)	Consistent	65.06	78.47	+13.41
Wizard of Internet (Human Eval)	Factually Incorrect	4.21	3.94	-0.27
Wizard of Internet (Human Eval)	Knowledgeable	27.88	46.49	+18.61
Wizard of Internet (Human Eval)	Per-Turn Engaging	83.52	90.41	+6.89
Comparison on Topical Prompts (Jan 2022) showing ability to handle new information compared to frozen LMs.
Topical Prompts	True	14	43	+29
Topical Prompts	Hallucination	73	58	-15
Topical Prompts	Hallucination	62	58	-4
Topical Prompts	Topical	4	15	+11

Main Takeaways

Modular decomposition (Search -> Knowledge -> Response) significantly reduces hallucinations compared to standard LMs.
SeeKeR enables much smaller models (1.5B) to outperform massive models (GPT3 175B) on tasks requiring up-to-date knowledge (topicality and factuality).
Separate knowledge generation step forces the model to ground responses in retrieved evidence, improving consistency.
Adding 'January 2022' to search queries in ablation studies further improved topicality, highlighting the importance of search query quality.

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures (Encoder-Decoder and Decoder-only)
Retrieval-Augmented Generation (RAG)
Fusion-in-Decoder (FiD)

Key Terms

SeeKeR: Search-engine -> Knowledge -> Response; the proposed modular architecture

FiD: Fusion-in-Decoder; a method where retrieved documents are encoded independently and then attended to jointly by the decoder

BlenderBot 2: A state-of-the-art internet-augmented dialogue model used as a baseline

Common Crawl: A massive open repository of web crawl data used for training and retrieval

Hallucination: When a language model generates factually incorrect or fabricated information

WizInt: Wizard of Internet; a dataset containing dialogues grounded in internet search results