← Back to Paper List

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, Yaxin Peng
Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited (2024)
MM Reasoning Benchmark Pretraining

📝 Paper Summary

Small Vision-Language Models Efficient Multi-Modal Learning
LLaVA-Phi is a compact 3B-parameter multi-modal model that achieves state-of-the-art performance by combining the Phi-2 small language model with the LLaVA visual instruction tuning recipe.
Core Problem
Existing high-performance vision-language models (VLMs) rely on large language models (7B+ parameters), making them too slow and computationally expensive for real-time applications on edge devices like mobile phones and robots.
Why it matters:
  • Time-sensitive applications like autonomous driving and robotics require real-time interaction speed which large models cannot provide
  • Deployment on edge devices (smartphones) is restricted by the memory and compute requirements of 7B+ parameter models
  • Proprietary small models (Gemini-Nano) are closed-source, hindering open research into efficient multi-modal systems
Concrete Example: When asked to write Python code to plot a bar chart from an image of an Excel table, LLaVA-1.5-13B (a larger model) fails to follow instructions and only prints the data, whereas LLaVA-Phi (3B) correctly generates matplotlib code to render the plot.
Key Novelty
High-Performance Small VLM via Phi-2 Integration
  • Leverages Phi-2 (2.7B), a small language model highly optimized for reasoning and coding, as the language backbone instead of the standard LLaMA/Vicuna (7B/13B)
  • Combines the compact backbone with the proven LLaVA-1.5 training recipe (connector pre-training + visual instruction tuning) to unlock multi-modal capabilities at a fraction of the size
Evaluation Highlights
  • Outperforms larger 7B+ models (IDEFICS-9B, InstructBLIP-7B) on ScienceQA with 71.4% accuracy despite having only 3B parameters
  • Achieves comparable performance to LLaVA-1.5-13B on visual reasoning benchmarks like VQAv2 and POPE
  • Surpasses concurrent efficient model MobileVLM on all five reported benchmarks, including a significant lead on ScienceQA
Breakthrough Assessment
8/10
Demonstrates that model quality (Phi-2) matters more than sheer size for VLM performance, enabling potent multi-modal agents on edge devices. Beats models 3x its size.
×