Evaluation Setup
Closed-set classification over frequent answers (top 7,236 answers cover 99% of choices)
Benchmarks:
- MemexQA Dataset (Multimodal QA over Photo Albums) [New]
- SQuAD (Text QA (Machine Comprehension))
- YFCC100M subset (VideoQA (Large scale))
Metrics:
- Accuracy (for MemexQA)
- F1 Score (for SQuAD)
- Statistical methodology: Statistically significant differences reported (p-value not explicitly stated in text but significance claimed)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MemexQA |
Overall Accuracy |
0.390 |
0.484 |
+0.094
|
| MemexQA |
Overall Accuracy |
0.433 |
0.484 |
+0.051
|
| MemexQA |
Overall Accuracy |
0.418 |
0.484 |
+0.066
|
| SQuAD (Text QA) |
F1 |
0.760 |
0.767 |
+0.007
|
Main Takeaways
- MemexNet consistently outperforms strong VQA baselines (LSTM, Attention, Multi-channel) on the MemexQA task
- The 'when' and 'what' question types see the largest gains from MMLookupNet, proving the value of fusing time/concept metadata
- Human evaluation shows a massive gap (92.7% vs 48.4%), highlighting that collective multimodal reasoning is far from solved
- Scalable to large video collections (YFCC100M), processing queries in ~1.3 seconds over 800k videos