← Back to Paper List

Bootstrapping Large Language Models with Outsideknowledge for Knowledge-based Visual Question Answering

Y Min, Y Sun, Y Zhu, J Zhu, B Zhang
Department of Computer Science and Technology, Tsinghua University, State Key Laboratory for Novel Software Technology, Nanjing University, Alibaba Group
Machine Intelligence Research, 2026 (2026)
MM RAG QA

📝 Paper Summary

Knowledge-Based Visual Question Answering (KB-VQA) Retrieval-Augmented Generation (RAG)
BootLM treats external knowledge as a latent variable, iteratively refining a Large Language Model using its own generated rough answers to guide retrieval and subsequent answer generation.
Core Problem
Large Models (LMs) often fail on KB-VQA tasks requiring fine-grained domain knowledge because their internal knowledge is too coarse, yet standard RAG pipelines struggle to align retrieved knowledge with the LM's reasoning.
Why it matters:
  • LMs hallucinate when lacking sufficient evidence for fine-grained questions (e.g., identifying specific events from visual cues)
  • Existing pipelines treat retrieval as a fixed pre-processing step, failing to adapt the retriever based on the LM's evolving understanding
  • Directly prompting LMs often misses domain-specific details (like specific animal breeds or local events) required for correct reasoning
Concrete Example: Question: 'What kind of event would these animals be at?' (Image shows sheep with yellow tags). The base LM guesses 'zoo'. BootLM first generates a rough answer ('sheep show'), uses it to retrieve 'State Fair of Texas', and then generates the correct specific answer 'fair'.
Key Novelty
Bootstrapping LMs via Variational EM
  • Formulates KB-VQA as a probabilistic model where the 'rough answer' and 'retrieved knowledge' are latent variables
  • Uses the LM itself to approximate the posterior (generate rough answers) in the E-step, rather than a separate model
  • Updates the LM in the M-step using a likelihood objective derived from its own retrieved evidence, creating a self-reinforcing cycle
Evaluation Highlights
  • Achieves 62.00 VQA score on OK-VQA, outperforming standard fine-tuned MiniGPT-v2 (57.82) by 4.18 points
  • Surpasses huge models like Flamingo (80B) and GPT-3 PromptCap (175B) using a much smaller base model (<10B)
  • Retrieval module reaches 89.12% PRRecall@5 when using rough answers and ROI features, outperforming standard DPR and VRR baselines
Breakthrough Assessment
7/10
Strong methodological contribution applying Variational EM to RAG-VQA. Achieves SOTA-competitive results with smaller models, effectively bridging implicit and explicit knowledge.
×