MemexQA: Visual Memex QA

📝 Paper Summary

Visual Question Answering (VQA) Personal Photo/Video Retrieval Memory Recall Multimodal Reasoning

MemexQA introduces a task and architecture for answering natural language questions about personal photo collections by reasoning across multiple images, timestamps, and metadata simultaneously.

Core Problem

Standard VQA answers questions about single images, but personal memory recall requires reasoning over large, dynamic collections of photos/videos and multimodal metadata (time, GPS, titles).

Why it matters:

People accumulate thousands of photos/videos and use them to recover memories, but manual searching is tedious
Current VQA models lack the ability to localize answers across multiple media documents or leverage collective reasoning
Personal assistants (Siri, Alexa) need to answer questions like 'When did we last go hiking?' which requires understanding the user's past data

Concrete Example: A user asks 'Where did we last see the fireworks?'. A standard VQA model looking at one photo of fireworks might just say 'sky'. MemexQA must find the specific fireworks photo with the latest timestamp (Dec 11, 2010) and use its GPS metadata to answer 'Maryland'.

Key Novelty

MemexNet with MultiModal Lookup Network (MMLookupNet)

Treats QA as a retrieval-then-inference problem over a dynamic collection, rather than just visual analysis of a single static image
Uses a unified end-to-end architecture that learns to attend to specific retrieved samples and specific modalities (time vs. text vs. image content) based on question type

Architecture

The overall framework of MemexNet

Evaluation Highlights

MemexNet achieves 48.4% overall accuracy on MemexQA, outperforming strong LSTM baselines (39.0%) and attention models (43.3%)
Human performance is significantly higher (92.7% with all data), indicating the task remains challenging for AI
Adapts to TextQA (SQuAD) yielding 0.767 F1, comparable to specialized text-only QA models like BiDAF (0.760)

Breakthrough Assessment

7/10

Pioneered the 'VQA over collections' task with a large dataset and a novel multimodal architecture. While accuracy is low compared to humans, it established a necessary extension of VQA to real-world personal data.

⚙️ Technical Details

Problem Definition

Setting: Given a collection of photos/videos X from a user and a natural language question Q, select answer y from candidate set

Inputs: Natural language question Q, collection of photos/videos X (images + timestamp + GPS + titles)

Outputs: Predicted answer y (classification over answer space)

Pipeline Flow

Question Understanding (Embeds Q into vector)
Content Search Engine (Retrieves top-k relevant photos/videos)
Answer Inference (MMLookupNet processes retrieved items to predict answer)

System Modules

Question Encoder (Question Understanding)

Encodes question content and category (e.g., is it asking for a date or place?)

Model or implementation: LSTM network

Query Encoder (Question Understanding)

Maps question to a query vector for the search engine

Model or implementation: SkipGram / VQE (Visual Query Embedding)

Content Search Engine

Retrieve relevant samples from the user's collection

Model or implementation: Off-the-shelf retrieval (E-Lamp Lite)

MMLookupNet (Answer Inference)

Fuses multimodal information from retrieved samples and attends to relevant rank positions/modalities

Model or implementation: Custom neural network with modality embeddings and attention-based LSTM

Classifier (Answer Inference)

Predict final answer class

Model or implementation: Fully connected layers (Softmax)

Novel Architectural Elements

MMLookupNet: A specialized module that learns embeddings for different modalities (time, GPS, text) and matches them against answer candidates, integrating a 'pilot answer' mechanism to gate non-matching answers

Modeling

Base Model: Custom architecture (MemexNet) using ResNet visual features and LSTM text encoders

Training Method: End-to-end supervised learning (classification)

Objective Functions:

Purpose: Minimize classification error on the correct answer.

Formally: Softmax cross-entropy loss.

Trainable Parameters: Question LSTM weights, MMLookupNet embeddings and attention weights, Classification layers

Training Data:

13,591 personal photos from 101 Flickr users
20,860 question-answer pairs crowdsourced via AMT
Split: 14,156 training QAs, 3,539 test QAs

Key Hyperparameters:

top_k_retrieved: 2
image_feature_dim: 300

Compute: 1.3 seconds per question over 800K videos (inference speed using single core Intel Xeon 2.53GHz CPU)

Comparison to Prior Work

vs. VQA/Visual7W: MemexQA requires reasoning over a *collection* of images + metadata, not just one image
vs. Memory Networks: MemexNet handles multimodal data (images, GPS, time) rather than just text facts
vs. standard Retrieval-QA: MemexNet learns an end-to-end representation (MMLookupNet) that fuses retrieval results with question context rather than just pipelining them

Limitations

Assumes all questions are answerable by public/objective information in photos (privacy/subjectivity constraint)
Relies on off-the-shelf search engines; retrieval module is not fine-tuned end-to-end
Significant gap between human performance (92.7%) and model performance (48.4%)
Dataset size (20k QAs) is relatively small compared to generic VQA datasets

Reproducibility

Code: https://memexqa.cs.cmu.edu/

Dataset available at https://memexqa.cs.cmu.edu/. Search engine (E-Lamp Lite) and visual features (ResNet) are standard/public. Code link provided in paper.

📊 Experiments & Results

Evaluation Setup

Closed-set classification over frequent answers (top 7,236 answers cover 99% of choices)

Benchmarks:

MemexQA Dataset (Multimodal QA over Photo Albums) [New]
SQuAD (Text QA (Machine Comprehension))
YFCC100M subset (VideoQA (Large scale))

Metrics:

Accuracy (for MemexQA)
F1 Score (for SQuAD)
Statistical methodology: Statistically significant differences reported (p-value not explicitly stated in text but significance claimed)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MemexQA	Overall Accuracy	0.390	0.484	+0.094
MemexQA	Overall Accuracy	0.433	0.484	+0.051
MemexQA	Overall Accuracy	0.418	0.484	+0.066
SQuAD (Text QA)	F1	0.760	0.767	+0.007

Experiment Figures

Detailed schematic of the MMLookupNet module

Main Takeaways

MemexNet consistently outperforms strong VQA baselines (LSTM, Attention, Multi-channel) on the MemexQA task
The 'when' and 'what' question types see the largest gains from MMLookupNet, proving the value of fusing time/concept metadata
Human evaluation shows a massive gap (92.7% vs 48.4%), highlighting that collective multimodal reasoning is far from solved
Scalable to large video collections (YFCC100M), processing queries in ~1.3 seconds over 800k videos

📚 Prerequisite Knowledge

Prerequisites

Visual Question Answering (VQA) basics
Recurrent Neural Networks (LSTMs)
Information Retrieval concepts (embedding-based search)
Multimodal fusion strategies

Key Terms

Memex: A hypothetical device conceptually described by Vannevar Bush in 1945 to store and quickly consult all of an individual's information/memory

MMLookupNet: MultiModal Lookup Network—the core module in MemexNet that represents retrieved samples by fusing embeddings of their various modalities (time, location, vision)

VQE: Visual Query Embedding—a method to map natural language queries into a visual concept space for retrieval

SkipGram: A word embedding model (Word2Vec) used here to encode questions and metadata text

BiDAF: Bi-Directional Attention Flow—a specific neural network architecture originally designed for text comprehension, used here as a comparison point

SIND: Stories in Sequences dataset—the source of the Flickr albums used to construct the MemexQA dataset