Evaluation Setup
Multimodal benchmarks across Vision, Audio, and Molecular domains.
Benchmarks:
- LLaVA Benchmarks (VQA and Multimodal Reasoning)
- LTU Benchmarks (Audio Classification and Captioning)
- MolCA Benchmarks (Molecular Captioning)
Metrics:
- Accuracy
- mAP (Audio)
- SPICE (Audio/Molecule Captioning)
- BLEU (Molecule Captioning)
- METEOR (Molecule Captioning)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LLaVA-1.5-7B performance remains stable up to DeepInsert-8, with significant drops at DeepInsert-12. |
| LLaVA-Avg (7B) |
Average Score |
66.5 |
66.7 |
+0.2
|
| LLaVA-Avg (7B) |
Average Score |
66.5 |
65.5 |
-1.0
|
| LTU (Audio) shows high redundancy, maintaining performance even when skipping 12-24 layers. |
| LTU Classification Avg |
Accuracy/mAP |
49.0 |
49.6 |
+0.6
|
| LTU Captioning Avg |
SPICE |
15.9 |
16.4 |
+0.5
|
| MolCA (Molecular) results show parity or improvement with deep insertion layers. |
| CheBI-20 (MolCA) |
BLEU-2 |
0.459 |
0.467 |
+0.008
|
Main Takeaways
- Multimodal tokens do not need to pass through all LLM layers; functional redundancy exists in early layers.
- Vision (LLaVA) tolerates skipping roughly 25% of layers (DI-8) with minimal loss (~1%), likely because it uses many tokens (576).
- Audio and Molecular modalities tolerate skipping up to 50% of layers (DI-12 to DI-16) with parity/gain, possibly due to fewer tokens (32 Q-Tokens) or higher training-data-to-token ratios.
- Efficiency gains (FLOPs/Latency) are achieved by simply training with the DeepInsert architecture, requiring no complex dynamic routing.