Evaluation Setup
Instruction following on text instructions and multi-modal reasoning on image-text tasks
Benchmarks:
- Alpaca Self-Instruct (Instruction Following)
- ScienceQA (Multi-modal Question Answering)
- MME (Multi-modal Evaluation)
- MMBench (Multi-modal Evaluation)
Metrics:
- Training Time
- Parameter Count
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Training Resources |
Learnable Parameters |
7000000000 |
1200000 |
-6998800000
|
| Training Resources |
Training Time (Hours) |
3 |
1 |
-2
|
Main Takeaways
- Achieves comparable instruction-following capability to fully fine-tuned Alpaca while being significantly more efficient (3x faster, 0.02% parameters).
- Zero-initialized gating effectively stabilizes training, preventing the noise from random initialization that typically hampers prompt/adapter tuning.
- The method successfully extends to multi-modal settings (image understanding) without architectural changes to the pre-trained LLM, unlike text-only LoRA implementations.