RAG-based Question Answering over Heterogeneous Data and Text

📝 Paper Summary

Modularized RAG pipeline

Quasar integrates unstructured text, structured tables, and knowledge graphs into a unified RAG pipeline by converting all sources into verbalized evidence and refining them through iterative GNN-based re-ranking.

Core Problem

LLMs struggle with long-tail entities and multi-hop questions requiring evidence from diverse modalities (text, tables, KGs), often failing to recall unpopular facts or aggregate dispersed information.

Why it matters:

Standard LLMs and RAG systems primarily rely on text (web pages/Wikipedia), neglecting the vast amount of high-quality structured data available in online tables and knowledge graphs
Questions about less popular entities (e.g., 'Lithuanian players in the German handball league') cause hallucinations in standard LLMs due to low frequency in pre-training data
Existing heterogeneous QA systems often lack robust components for question understanding and evidence filtering, leading to noise that confuses the answer generator

Concrete Example: For the question 'Which Chinese NBA player has the most matches?', a standard LLM might hallucinate based on popularity. Quasar solves this by retrieving from KGs (player lists), text (biographies), and tables (season stats), then aggregating these distinct modalities to find the correct answer.

Key Novelty

Unified Verbalization and GNN-based Filtering for Heterogeneous RAG

Converts all data types—Knowledge Graph triples, table rows, and text sentences—into a uniform 'verbalized' textual format, allowing a single downstream model to process them identically
Uses a Graph Neural Network (GNN) to model relationships between the question and retrieved evidence candidates, iteratively pruning the evidence pool from thousands to a high-quality top-k subset

Architecture

The four-stage pipeline of Quasar: Question Understanding, Evidence Retrieval, Re-ranking & Filtering, and Answer Generation.

Evaluation Highlights

Achieves comparable or better performance than GPT-4 on the CompMix benchmark while using orders of magnitude fewer parameters (8B vs. estimated >1T)
Establishes a new state-of-the-art on the TimeQuestions benchmark, significantly outperforming GPT-4 and Llama-3 baselines on temporal reasoning tasks
Demonstrates that combining all three sources (Text + KG + Tables) yields higher accuracy than any single or dual-source combination (e.g., Text+Tables)

Breakthrough Assessment

7/10

Strong integration of heterogeneous sources with a unified interface. While the individual components (GNNs, verbalization) exist in prior work, the end-to-end efficiency and SOTA results on TimeQuestions are significant.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering over heterogeneous sources (Text, Knowledge Graphs, Tables)

Inputs: Natural language question Q

Outputs: Answer A (entity, date, or literal) and supporting evidence snippets

Pipeline Flow

Group: Question Understanding: Input Question → BART (SI Generation) → Structured Intent
Group: Retrieval: Structured Intent → Clocq (KG) + BM25 (Text/Tables) → Raw Evidence Pool
Group: Re-ranking: Raw Evidence Pool → GNN/Cross-Encoder → Top-k Evidence
Group: Generation: Top-k Evidence + SI → Llama-3 → Answer

System Modules

Question Understanding (QU)

Decompose natural language question into structured intent (SI) with slots (Ans-Type, Entities, Time, etc.)

Model or implementation: BART-base (140M params)

Evidence Retrieval (ER)

Fetch candidate evidence from KG, Text, and Tables using the Structured Intent

Model or implementation: Clocq (for KG) + BM25 (for Text/Tables)

Re-Ranking & Filtering (RF)

Iteratively prune the large evidence pool to a small, high-quality subset

Model or implementation: Graph Neural Network (GNN) initialized with Cross-Encoder embeddings

Answer Generation (AG)

Generate the final answer and optional explanations

Model or implementation: Llama-3.1-8B-Instruct

Novel Architectural Elements

Iterative GNN-based pruning pipeline that progressively reduces evidence candidates (1000 -> 100 -> 30) to make LLM inference computationally feasible
Unified verbalization layer that treats KG triples, table rows, and text sentences as indistinguishable 'pseudo-sentences' for the ranking and generation stages

Modeling

Base Model: Llama-3.1-8B-Instruct (for generation), BART-base (for intent), MiniLM (for encoding)

Training Method: Supervised Fine-Tuning (SFT) and Weak Supervision for GNN

Objective Functions:

Purpose: Train GNN to identify relevant evidence.

Formally: Weak supervision where evidence nodes are labeled relevant if connected to a gold answer.

Trainable Parameters: GNN weights, Llama-3.1 adapters (implied by fine-tuning description)

Training Data:

CompMix training data (generated silver pairs for BART)
Weak supervision labels for GNN derived from QA pairs

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: 8 (AG stage)
epochs: 5 (GNN), 2 (AG)
+ 3 more
warmup_ratio: 0.01
GNN_initial_evidence_nodes: 1000
GNN_initial_entity_nodes: 4000

Compute: Orders of magnitude lower cost than GPT-4 (inference mainly on small/moderate models)

Comparison to Prior Work

vs. UniK-Qa: Quasar adds explicit Question Understanding and iterative GNN-based re-ranking
vs. Spaghetti: Quasar uses local/moderate models (8B) instead of relying solely on massive API-based models (GPT-4), achieving comparable accuracy at lower cost
vs. Binder [not cited in paper]: Quasar verbalizes tables rather than generating SQL/logic programs to query them

Limitations

Depends on the quality of entity disambiguation (Clocq); errors there propagate
Verbalization of very large tables might lose structural context compared to specialized table encoders
Focuses on one-shot questions; conversational capabilities mentioned but not the primary focus of evaluation

Reproducibility

Code is based on the Explaignn project (https://explaignn.mpi-inf.mpg.de), but the specific Quasar repository is not yet released. Uses public benchmarks (CompMix, TimeQuestions, Crag).

📊 Experiments & Results

Evaluation Setup

QA over heterogeneous sources (Text, KG, Tables)

Benchmarks:

CompMix (Heterogeneous QA)
Crag (RAG-based QA (subset))
TimeQuestions (Temporal QA)

Metrics:

Precision at 1 (P@1)
Answer Presence (AP@k)
Mean Reciprocal Rank (MRR@k)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CompMix	P@1	0.551	0.555	+0.004
CompMix	P@1	0.505	0.555	+0.050
TimeQuestions	P@1	0.490	0.570	+0.080
Crag	P@1	0.665	0.435	-0.230
CompMix	P@1	0.473	0.555	+0.082
CompMix	P@1	0.528	0.555	+0.027

Main Takeaways

Quasar matches or beats GPT-4 based systems on CompMix and TimeQuestions while using significantly smaller models (8B params vs proprietary API).
Unified retrieval (global ranking of mixed evidence) outperforms retrieving top-k separately per source type.
Question Understanding (Structured Intent) is critical; removing it drops performance noticeably.
Integrating all three sources (Text, KG, Tables) consistently yields the best performance compared to any subset of sources.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Graph Neural Networks (GNNs) for ranking
Entity linking/disambiguation
BART and Llama model families

Key Terms

Structured Intent (SI): A semi-structured frame (key-value pairs) extracted from a natural language question, capturing intent facets like Answer Type, Entities, Time, and Relation

Verbalization: The process of converting structured data (KG triples, table rows) into natural language sentences so they can be processed uniformly with text

Clocq: A specific retrieval method for Knowledge Graphs that fetches relevant subgraphs and disambiguates entities in a single step

GNN (Graph Neural Network): A neural network designed to operate on graph structures; used here to score the relevance of evidence nodes based on their connections to entity nodes

Cross-Encoder: A transformer model that processes two inputs (query and document) simultaneously to output a relevance score, typically more accurate but slower than bi-encoders

BM25: A classical probabilistic information retrieval function used to rank documents based on term frequency and inverse document frequency

CompMix: A benchmark dataset specifically designed for evaluating QA systems that must operate over heterogeneous sources (text, tables, KG)

TimeQuestions: A benchmark dataset focusing on temporal question answering requiring understanding of time points and intervals

BART: A transformer encoder-decoder model used here for the specific sub-task of generating the Structured Intent from the question

DOM-tree: Document Object Model tree—the structural representation of a webpage; used here to extract context labels for table rows