Evaluation Setup
Evaluated on personalized tasks (Captioning, QA, Visual Recognition) using a held-out test set derived from CustomConcept101 and other sources.
Benchmarks:
- CustomConcept101 (Test Split) (Personalized Image Captioning)
- CustomConcept101 (QA Split) (Personalized Question Answering) [New]
- Visual Recognition Benchmark (Identity Recognition / Grounding) [New]
Metrics:
- CIDEr (Captioning)
- BLEU-4 (Captioning)
- METEOR (Captioning)
- Accuracy (QA, Recognition)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Personalized Image Captioning results comparing RAP against fine-tuning baselines on CustomConcept101. |
| CustomConcept101 |
CIDEr |
76.8 |
84.1 |
+7.3
|
| CustomConcept101 |
CIDEr |
73.5 |
84.1 |
+10.6
|
| CustomConcept101 |
BLEU-4 |
46.5 |
49.2 |
+2.7
|
| Personalized Question Answering performance showing improvements over baselines. |
| Personalized VQA (Custom) |
Accuracy |
53.4 |
61.5 |
+8.1
|
Main Takeaways
- RAP outperforms fine-tuning based methods (MyVLM, Yo'LLaVA) across captioning and QA metrics despite not updating parameters for new concepts.
- The method demonstrates strong 'few-shot' (1-shot) capability, effectively identifying concepts from a single reference image.
- Real-time editing is possible: users can change a concept's name in the database and the model updates its output immediately (demonstrated qualitatively in Table 12).