Multimodal Data Lake: A centralized repository that stores data in various formats (text, tables, images) at scale
RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
NL2SQL: Natural Language to SQL—converting human language questions into database queries
TabFact: A benchmark dataset for verifying factual claims based on tabular data
PASTA: A pre-trained model designed specifically for table-based fact verification tasks
Recall@K: The percentage of relevant items found in the top-K retrieved results
TF-IDF: Term Frequency-Inverse Document Frequency—a statistical measure used to evaluate how important a word is to a document in a collection
BM25: Best Matching 25—a ranking function used by search engines to estimate the relevance of documents to a given search query
CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images with text captions
Embedding: A dense vector representation of data (text, image, etc.) where similar items are close in vector space