← Back to Paper List

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras

A Abouelenin, A Ashfaq, A Atkinson, H Awadalla…
Microsoft
arXiv, 3/2025 (2025)
MM Speech Pretraining Reasoning RL

📝 Paper Summary

Small Language Models (SLMs) Multimodal Large Language Models (MLLMs)
Phi-4-Mini and Phi-4-Multimodal are compact 3.8B models that achieve state-of-the-art performance by leveraging curated synthetic data and a mixture-of-LoRAs architecture to unify text, vision, and speech modalities without interference.
Core Problem
Multimodal models typically require fine-tuning the base language model, which degrades text performance, or require separate models for different modalities, which is inefficient for resource-constrained devices.
Why it matters:
  • Deploying multiple specialized models on edge devices is computationally expensive and memory-intensive
  • Fine-tuning a base model for vision or audio often causes 'catastrophic forgetting' of its original reasoning and language capabilities
  • Existing solutions like cross-attention layers (e.g., Flamingo) often lag behind fully fine-tuned models in performance
Concrete Example: When a standard multimodal model is fine-tuned to understand images, its ability to solve complex text-only math problems often drops significantly. Phi-4-Multimodal avoids this by keeping the base text model frozen and using specialized adapters.
Key Novelty
Unified Multimodal SLM via Mixture of LoRAs
  • Integrates vision, speech, and text into a single model by attaching modality-specific LoRA (Low-Rank Adaptation) adapters to a frozen language backbone
  • Uses a dynamic multi-crop strategy for images that calculates crops based on size rather than just aspect ratio, avoiding unreasonable resizing of small images
  • Incorporates a dedicated speech post-training stage that unlocks speech summarization and translation, unlike models that only perform recognition (ASR)
Evaluation Highlights
  • Ranks first in the OpenASR leaderboard to date, despite the speech LoRA component having only 460 million parameters
  • Matches the performance of models twice its size on math and coding tasks requiring complex reasoning
  • Achieves reasoning performance on par with significantly larger models like DeepSeek-R1-Distill-Qwen-7B (in the experimental reasoning-enhanced version)
Breakthrough Assessment
9/10
Achieves SOTA performance for its size class (3.8B) across text, vision, and speech while solving the modality interference problem via mixture-of-LoRAs. Strong practical value for edge deployment.
×