MM-RAG: Multi-Modal Retrieval-Augmented Generation—systems that answer questions about images by retrieving external text or structured data
Egocentric images: First-person perspective images captured by wearable devices like smart glasses
Hallucination: When a model generates factually incorrect information not supported by the retrieved context or image
Mock API: Simulated search interfaces provided by the benchmark to access the knowledge graph and webpage corpus
Simple-recognition: Questions answerable directly from the image (e.g., brand name visible on product)
Simple-knowledge: Questions requiring external facts (e.g., price of a product shown in image)
Torso-to-tail entities: Entities with medium (torso) to low (tail) popularity, which are harder for models to recognize than popular (head) entities
LLM-as-a-judge: Using a strong Language Model to evaluate the correctness of answers generated by other models