TextRAG: Retrieval methods that convert all visual content (images, tables) into text summaries before retrieval, often losing visual details
VisRAG: Retrieval methods that treat document pages as images (screenshots) and use vision-language models for embedding and retrieval
Late Interaction: A scoring mechanism where query terms interact with document terms (or sub-components) individually at retrieval time, rather than collapsing everything into single vector
Layered Component Graph: A graph structure introduced by this paper with two levels: coarse nodes (whole images/paragraphs) and fine nodes (objects/sentences), linked by containment and semantic edges
Multihop reasoning: The ability to answer questions by connecting information from multiple distinct documents or components
Subcomponent: A finer-grained unit of information extracted from a larger component, such as a sentence from a paragraph, a row from a table, or a visual object from an image
ColPali: A state-of-the-art vision-language retrieval model that uses late interaction on multi-vector page embeddings