Symphony: Towards Trustworthy Question Answering and Verification using RAG over Multimodal Data Lakes

📝 Paper Summary

Modularized RAG pipeline Multimodal Data Lakes

Symphony is a multimodal RAG system that decomposes complex questions for reasoning and employs a loosely coupled verification module to cross-check answers against private or public data lakes.

Core Problem

LLMs often hallucinate inaccurate information, especially when dealing with complex queries over multimodal data lakes where factual correctness is critical for decision-making.

Why it matters:

In 2023, chatbots were estimated to hallucinate 27% of the time, with factual errors in 46% of generated texts, undermining trust in high-stakes applications.
Existing solutions focus on alignment or prompt engineering but lack robust, explicit verification mechanisms against reliable external data sources (like private enterprise data lakes).
Complex questions often require aggregating information from multiple heterogeneous sources (tables, text, images), which standard single-step retrieval often fails to handle correctly.

Concrete Example: A user asks about a film's cast based on a Wikipedia table. The LLM might hallucinate that 'Meagan Good' did not appear in 'Stomp the Yard'. Symphony retrieves the specific cast table, identifies 'Meagan Good' in the 'April Palmer' role row, and refutes the LLM's claim with evidence.

Key Novelty

Decompose-Reason-Verify Framework for Multimodal RAG

Separates the Reasoning process (generating an answer via question decomposition and tool use) from the Verification process (checking that answer against data lakes).
Uses an iterative, prompt-based decomposition strategy where an LLM breaks complex queries into sub-questions targeting specific data items (tables or text).
Introduces a verification module that treats the generated answer as a hypothesis, retrieving supporting/refuting evidence from potentially different (private) data lakes to validate it.

Evaluation Highlights

On a multimodal data lake of 400K tables and 6M passages, Symphony achieves 77.8% Recall@20 for retrieving relevant data items.
In a verification task using TabFact, the task-specific PASTA model achieves 89% accuracy when relevant tables are retrieved, outperforming GPT-3.5 (75%).
Symphony's decomposition strategy successfully generates useful sub-queries for 77.8% of test cases (score of 2/2 by human evaluation).

Breakthrough Assessment

6/10

Proposes a solid architecture for trustworthy RAG with verification. While the components (decomposition, retrieval, verification) are known, integrating them into a unified multimodal system is valuable. Evaluation is preliminary (small sample sizes).

⚙️ Technical Details

Problem Definition

Setting: Question Answering and Verification over Multimodal Data Lake L

Inputs: Natural language question Q and Multimodal Data Lake L

Outputs: Answer A and Verification Result (Correct/Incorrect with explanation)

Pipeline Flow

Discovery Module: Retrieves relevant data items (text, tables, images)
Reasoning Module: Decomposes question → executes sub-queries → aggregates sub-answers
Verification Module: Takes generated answer → retrieves evidence → verifies correctness

System Modules

Discovery Engine

Identify relevant data files from multimodal data lakes efficiently

Model or implementation: Hybrid: BM25/TF-IDF (word-level) + Dense Encoders (embedding-based)

Decomposition Agent (Reasoning)

Break down complex queries into manageable sub-questions targeted at specific data sources

Model or implementation: LLM (e.g., GPT-3) with iterative prompting

Executor & Aggregator (Reasoning)

Execute sub-questions using tools (NL2SQL, TableQA) and combine results

Model or implementation: LLM or DBMS tools (NL2SQL)

Verifier

Assess the correctness of the provided answer A using retrieved evidence

Model or implementation: Dual approach: Generic LLM (GPT-3.5) OR Task-specific model (PASTA)

Novel Architectural Elements

Loose coupling of Reasoning and Verification modules, allowing verification against different (private) data sources than those used for generation
Iterative prompt-based decomposition loop that dynamically assigns sub-questions to specific retrieved data items

Modeling

Base Model: GPT-3 / GPT-3.5 for general reasoning and decomposition; PASTA for table verification

Training Method: Prompt Engineering and RAG (Evaluation-only paper)

Adaptation: None (Inference-only RAG)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TrustLLM: Symphony focuses on post-hoc verification using external data lakes rather than intrinsic model evaluation
vs. Standard RAG (e.g., Atlas): Explicit decomposition step for complex queries and a dedicated verification loop after generation
vs. Self-RAG [not cited in paper]: Uses external data lakes for verification rather than self-reflection tokens
+ 1 more
vs. Chain-of-Verification [not cited in paper]: Symphony retrieves new evidence specifically for verification, whereas CoVe typically relies on the model's internal knowledge or initial context

Limitations

Decomposition struggles with complex syntactic structures (e.g., misinterpreting subjects in passive voice sentences)
Cross-modal discovery is still preliminary; current work barely touches on modeling relationships across modalities
Task-specific verifiers (PASTA) fail to generalize when retrieved evidence is irrelevant (dropping accuracy from 0.89 to 0.72)
Evaluation is conducted on a small set of manually crafted queries (18 queries) for the reasoning module

Reproducibility

Code: Not reported in the paper

No code repository provided. Data lake uses public resources (Wikipedia, TabFact) but the specific subset/index is not released. Prompt templates for decomposition are described conceptually but full text not provided.

📊 Experiments & Results

Evaluation Setup

Two-part evaluation: (1) Reasoning over Wikipedia tables/text, (2) Verification using TabFact claims

Benchmarks:

Custom Wikipedia Subset (Multimodal QA (Tables + Text)) [New]
TabFact (Fact Verification)

Metrics:

Recall@K (R@K)
Decomposition Quality Score (Human Eval 0-2)
Verification Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Data Discovery performance on the custom Wikipedia subset containing 400K tables and 6M passages.
Custom Wikipedia Subset	Recall@5	Not reported in the paper	40.8%	Not reported in the paper
Custom Wikipedia Subset	Recall@20	Not reported in the paper	77.8%	Not reported in the paper
Verification performance comparing a generic LLM (GPT-3.5) vs. a specialized model (PASTA) on TabFact claims.
TabFact	Accuracy (Relevant Evidence)	0.75	0.89	+0.14
TabFact	Accuracy (Irrelevant Evidence)	0.91	0.72	-0.19

Main Takeaways

Task-specific models (PASTA) are superior for verification when relevant evidence is found (+14% accuracy), but generic LLMs (GPT-3.5) are more robust at identifying irrelevant evidence.
Effective query decomposition allows answering complex questions that require joining information from multiple sources (e.g., text + table).
Discovery recall improves significantly with K (from 40.8% at K=5 to 77.8% at K=20), indicating the need for effective reranking or larger context windows.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Vector Embeddings
Database Management Systems (DBMS)
Natural Language to SQL (NL2SQL)

Key Terms

Multimodal Data Lake: A centralized repository that stores data in various formats (text, tables, images) at scale

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

NL2SQL: Natural Language to SQL—converting human language questions into database queries

TabFact: A benchmark dataset for verifying factual claims based on tabular data

PASTA: A pre-trained model designed specifically for table-based fact verification tasks

Recall@K: The percentage of relevant items found in the top-K retrieved results

TF-IDF: Term Frequency-Inverse Document Frequency—a statistical measure used to evaluate how important a word is to a document in a collection

BM25: Best Matching 25—a ranking function used by search engines to estimate the relevance of documents to a given search query

CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images with text captions

Embedding: A dense vector representation of data (text, image, etc.) where similar items are close in vector space