LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval

📝 Paper Summary

Multimodal Retrieval Open-domain Question Answering

LILaC structures multimodal documents into a two-layer graph of coarse and fine components, using late interaction to accurately retrieve relevant subgraphs for complex queries without fine-tuning.

Core Problem

Current multimodal retrieval methods either lose visual information by converting everything to text (TextRAG) or retrieve fixed, coarse-grained screenshots (VisRAG) that include irrelevant noise and miss structural connections.

Why it matters:

Fixed-granularity retrieval (e.g., full pages) dilutes the signal of small but crucial details needed for precise answers
Independent retrieval of pages ignores explicit links and semantic relationships, failing at multihop reasoning required for complex open-domain questions
Text-only conversion loses critical visual cues (e.g., specific objects in images) that are essential for answering visual queries

Concrete Example: A query asks about 'minarets' around the Taj Mahal. A TextRAG summarizer might omit the word 'minarets,' causing retrieval failure. A VisRAG retriever pulls the whole page screenshot, where the relevant text is tiny compared to irrelevant content, diluting the embedding match. Furthermore, neither method connects the 'Taj Mahal' page to a 'Shah Jahan' page via hyperlinks to answer a multihop question about who built it.

Key Novelty

Layered Component Graph with Late Interaction (LILaC)

Constructs a graph with two layers: a coarse layer (paragraphs, tables, images) for context and traversing links, and a fine layer (sentences, rows, objects) for precise matching
Traverses this graph starting from coarse candidates, but scores edges dynamically by matching decomposed sub-queries against the fine-grained sub-components inside the nodes (late interaction)

Architecture

Overview of LILaC framework: (a) Graph Construction, (b) Query Decomposition, and (c) Late-interaction Retrieval

Evaluation Highlights

Achieves state-of-the-art retrieval accuracy on 5 benchmarks, outperforming VisRAG by 15.75% and ColPali by 11.74% on average MRR@10
Improves Recall@3 on multihop-heavy datasets (MultimodalQA) by ~60% compared to previous VisRAG approaches
Attains SOTA end-to-end QA performance with 52.56% Exact Match average, surpassing the best VisRAG setup by over 18%

Breakthrough Assessment

9/10

Significantly outperforms SOTA (including ColPali) without any training, solving key granularity and multihop issues in multimodal RAG via a clever graph structure.

⚙️ Technical Details

Problem Definition

Setting: Open-domain multimodal document retrieval where documents contain text, tables, and images linked by structural relationships

Inputs: Natural language query Q and a corpus of multimodal documents D

Outputs: Ranked list of multimodal components R relevant to the query

Pipeline Flow

Offline: Layered Component Graph Construction (Document Parsing → Subcomponent Extraction → Edge Creation)
Online: Query Decomposition (Query → Sub-queries + Modalities)
Online: Candidate Search (Coarse retrieval of seed nodes)
Online: Subgraph Retrieval (Beam search traversal with Late Interaction scoring)

System Modules

Layered Graph Constructor

Builds the retrieval index by parsing documents into coarse/fine nodes and connecting them

Model or implementation: Various parsers: SaT (text), Object Detector (images), Table Parser

Query Decomposer

Breaks complex queries into modality-specific sub-queries for precise matching

Model or implementation: LLM (e.g., Qwen2.5-72B)

Subgraph Retriever

Traverses the graph to find relevant component chains

Model or implementation: Pre-trained Multimodal Embedder (e.g., MM-Embed) + Late Interaction Scoring

Novel Architectural Elements

Dual-granularity graph structure explicitly linking coarse components (paragraphs/images) to fine subcomponents (sentences/objects)
Edge-based late interaction scoring that matches decomposed queries against subcomponent sets on-the-fly during graph traversal

Modeling

Base Model: Qwen2.5-VL 7B (for generation), MM-Embed / UniME / mmE5 (for embeddings)

Training Method: Training-free framework using pre-trained models

Compute: Offline construction is embedding-heavy (approx 2 hours for full corpus); Online retrieval takes ~48ms (traversal) + ~1.4s (LLM decomposition)

Comparison to Prior Work

vs. VisRAG: Uses dynamic fine-grained graph traversal instead of fixed-granularity dense retrieval
vs. ColPali: Explicitly models structural links (multihop) and uses object-level subcomponents instead of patch-level vectors
vs. TextRAG: Preserves native visual embeddings instead of lossy text summarization
+ 1 more
vs. RAPTOR [not cited in paper]: RAPTOR builds hierarchical text summaries for retrieval; LILaC builds a hierarchical multimodal graph with explicit structural links

Limitations

Heavy reliance on the quality of the LLM-based query decomposition step (major latency bottleneck)
Dependent on the performance of off-the-shelf subcomponent extractors (object detectors, table parsers)
Graph construction is computationally expensive offline due to embedding all subcomponents
End-to-end generation still has room for improvement despite better retrieval

Reproducibility

Code: https://github.com/joohyung00/lilac

publicly available (https://github.com/joohyung00/lilac). Code, data, and artifacts are promised. Uses open-source models (Qwen, MM-Embed).

📊 Experiments & Results

Evaluation Setup

Open-domain multimodal retrieval and QA

Benchmarks:

MP-DocVQA (Industrial document VQA)
SlideVQA (Presentation slide VQA (multihop))
InfoVQA (Infographic VQA)
MultimodalQA (Webpage retrieval (multihop))
MMCoQA (Conversational multimodal QA)

Metrics:

Recall@3 (R@3)
Mean Reciprocal Rank at 10 (MRR@10)
Exact Match (EM)
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval accuracy comparison showing LILaC (with MM-Embed) outperforming state-of-the-art baselines across all benchmarks.
MP-DocVQA	MRR@10	75.55	78.75	+3.20
MultimodalQA	Recall@3	58.73	69.07	+10.34
MMCoQA	MRR@10	41.45	50.77	+9.32
End-to-end generation results utilizing the retrieved components.
MultimodalQA	Exact Match (EM)	22.24	44.57	+22.33
MP-DocVQA	Exact Match (EM)	64.46	65.48	+1.02

Experiment Figures

Runtime comparison and breakdown. (a) Total execution time vs baselines. (b) LILaC internal breakdown.

Main Takeaways

Consistent SOTA performance across 5 diverse benchmarks without any fine-tuning, validating the robustness of the graph-based approach.
Massive gains in multihop datasets (MultimodalQA, MMCoQA) confirm the effectiveness of explicitly modeling structural links between components.
Ablation studies show that both the Layered Component Graph and Query Decomposition are essential; removing the graph structure drops performance significantly.
Outperforms computationally heavier 'multi-vector' approaches like ColPali in retrieval accuracy while being faster in the retrieval step (though slower in preprocessing).

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Retrieval-Augmented Generation (RAG)
Knowledge of graph data structures (nodes, edges)
Familiarity with multimodal embeddings (CLIP, SigLIP, etc.)

Key Terms

TextRAG: Retrieval methods that convert all visual content (images, tables) into text summaries before retrieval, often losing visual details

VisRAG: Retrieval methods that treat document pages as images (screenshots) and use vision-language models for embedding and retrieval

Late Interaction: A scoring mechanism where query terms interact with document terms (or sub-components) individually at retrieval time, rather than collapsing everything into single vector

Layered Component Graph: A graph structure introduced by this paper with two levels: coarse nodes (whole images/paragraphs) and fine nodes (objects/sentences), linked by containment and semantic edges

Multihop reasoning: The ability to answer questions by connecting information from multiple distinct documents or components

Subcomponent: A finer-grained unit of information extracted from a larger component, such as a sentence from a paragraph, a row from a table, or a visual object from an image

ColPali: A state-of-the-art vision-language retrieval model that uses late interaction on multi-vector page embeddings