Bootstrapping Large Language Models with Outsideknowledge for Knowledge-based Visual Question Answering

📝 Paper Summary

Knowledge-Based Visual Question Answering (KB-VQA) Retrieval-Augmented Generation (RAG)

BootLM treats external knowledge as a latent variable, iteratively refining a Large Language Model using its own generated rough answers to guide retrieval and subsequent answer generation.

Core Problem

Large Models (LMs) often fail on KB-VQA tasks requiring fine-grained domain knowledge because their internal knowledge is too coarse, yet standard RAG pipelines struggle to align retrieved knowledge with the LM's reasoning.

Why it matters:

LMs hallucinate when lacking sufficient evidence for fine-grained questions (e.g., identifying specific events from visual cues)
Existing pipelines treat retrieval as a fixed pre-processing step, failing to adapt the retriever based on the LM's evolving understanding
Directly prompting LMs often misses domain-specific details (like specific animal breeds or local events) required for correct reasoning

Concrete Example: Question: 'What kind of event would these animals be at?' (Image shows sheep with yellow tags). The base LM guesses 'zoo'. BootLM first generates a rough answer ('sheep show'), uses it to retrieve 'State Fair of Texas', and then generates the correct specific answer 'fair'.

Key Novelty

Bootstrapping LMs via Variational EM

Formulates KB-VQA as a probabilistic model where the 'rough answer' and 'retrieved knowledge' are latent variables
Uses the LM itself to approximate the posterior (generate rough answers) in the E-step, rather than a separate model
Updates the LM in the M-step using a likelihood objective derived from its own retrieved evidence, creating a self-reinforcing cycle

Evaluation Highlights

Achieves 62.00 VQA score on OK-VQA, outperforming standard fine-tuned MiniGPT-v2 (57.82) by 4.18 points
Surpasses huge models like Flamingo (80B) and GPT-3 PromptCap (175B) using a much smaller base model (<10B)
Retrieval module reaches 89.12% PRRecall@5 when using rough answers and ROI features, outperforming standard DPR and VRR baselines

Breakthrough Assessment

7/10

Strong methodological contribution applying Variational EM to RAG-VQA. Achieves SOTA-competitive results with smaller models, effectively bridging implicit and explicit knowledge.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-based Visual Question Answering (KB-VQA) where external knowledge is required

Inputs: Image v and Question text q

Outputs: Answer y

Pipeline Flow

Rough Answer Generation (LM generates caption/guess)
Knowledge Retrieval (Neural retriever fetches documents based on rough answer)
Answer Refinement (LM generates final answer using retrieved docs)

System Modules

Rough Answer Generator (Posterior Approximation (E-step))

Generate intermediate representations (summary, description, direct answer) to serve as retrieval queries

Model or implementation: MiniGPT-v2 (shared weights)

Neural Retriever (Posterior Approximation (E-step))

Retrieve relevant knowledge passages from external KB

Model or implementation: ColBERTv2 (Late Interaction)

Final Answer Generator

Generate the final answer conditioning on the image, question, and retrieved knowledge

Model or implementation: MiniGPT-v2 (shared weights)

Novel Architectural Elements

Unified Generation Module: The same LM parameters generate the latent 'rough answer' and the final 'refined answer'
Latent Variable Formulation: Explicitly models external knowledge and rough guesses as latent variables within a Variational EM framework

Modeling

Base Model: MiniGPT-v2 (based on Llama-2-7B-Chat)

Training Method: Variational Expectation-Maximization (EM)

Objective Functions:

Purpose: Optimize the neural retriever in the E-step.

Formally: Contrastive loss using pseudo-labels (documents containing ground truth answers) to maximize p_phi(d|z,y).
Purpose: Optimize the LM generation in the M-step.

Formally: Cross-entropy loss maximizing the expected log-likelihood Eq[log p(z|x) + log p(y|d,z)], approximated via Monte Carlo sampling.

Adaptation: LoRA (rank=64, alpha=16)

Trainable Parameters: Linear projection layer and LoRA parameters of Llama-2

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 10
image_resolution: 448x448
+ 2 more
weight_decay: 0.05
warmup_learning_rate: 1e-6

Compute: 4 Nvidia A100 GPUs; Indexing ~2 hours; E-step ~2 GPU hours; M-step ~4 GPU hours

Comparison to Prior Work

vs. RA-VQA-v2: BootLM refines the retriever and generator jointly via EM, whereas RA-VQA-v2 uses a fixed pipeline. BootLM uses a smaller base model (<10B vs BLIP2 strategies).
vs. Prophet: BootLM integrates explicit retrieval rather than relying solely on prompting implicit knowledge.
vs. PaLM-E: BootLM achieves comparable performance (62.0 vs 66.1) with <2% of the parameters (<10B vs 562B).

Limitations

Pseudo-label heuristic in E-step may overemphasize passages containing keywords while missing contextually relevant ones
Struggles with multi-hop logical reasoning where retrieved knowledge must be chained
Performance gap still exists compared to massive models (562B parameters)

Reproducibility

Code: https://github.com/Tsinghua-MI/BootLM

Uses public datasets (OK-VQA, FVQA) and public base model (MiniGPT-v2). External KB is Google Search Corpus. Code URL not explicitly provided in text.

📊 Experiments & Results

Evaluation Setup

Open-ended KB-VQA on standard benchmarks using external knowledge bases

Benchmarks:

OK-VQA (Knowledge-based VQA)
FVQA (Fact-based VQA)

Metrics:

VQA Score (soft accuracy)
PRRecall@K (Pseudo-Relevance Recall)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on OK-VQA show BootLM significantly improving the base model and outperforming similar-sized baselines.
OK-VQA	VQA Score	57.82	62.00	+4.18
OK-VQA	VQA Score	57.80	62.00	+4.20
OK-VQA	VQA Score	62.08	62.00	-0.08
Retrieval performance analysis showing the benefit of multi-vector queries.
Google Search Corpus (OK-VQA)	PRRecall@5	83.43	89.12	+5.69
Generalization to fact-based VQA.
FVQA	Top-1 Accuracy	52.3	63.9	+11.6

Main Takeaways

Bootstrapping mechanism works: Iterative EM training improves performance over 3-4 iterations (58.0 -> 61.9 avg score)
Rough answers are effective queries: Using the LM's own generated summaries and guesses as retrieval queries is superior to standard captions
Latent variable formulation effectively bridges the gap between implicit LM knowledge and explicit external KB knowledge

📚 Prerequisite Knowledge

Prerequisites

Variational Expectation-Maximization (EM) algorithm
Retrieval-Augmented Generation (RAG)
Visual Question Answering (VQA) architectures

Key Terms

KB-VQA: Knowledge-based Visual Question Answering—VQA tasks requiring external world knowledge beyond just image recognition

Variational EM: An iterative optimization method where the E-step approximates a posterior distribution of latent variables and the M-step maximizes the expected log-likelihood

Latent Variable: Variables that are not directly observed but are inferred from the observed data (here, the 'rough answer' and 'retrieved knowledge')

Rough Answer: An intermediate output generated by the LM (e.g., a caption or initial guess) used to query the knowledge base

Late Interaction: A retrieval mechanism (like ColBERT) that interacts query and document encodings at a fine-grained token level rather than compressing them into single vectors

PRRecall: Pseudo-Relevance Recall—a metric measuring if retrieved documents contain the ground truth answer string

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

ColBERT: Contextualized Late Interaction over BERT—a neural retrieval model using late interaction