| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| 7B Scale Chameleon Setting (Text + Image): MoT reaches dense baseline performance significantly faster. | ||||
| Chameleon Pre-training | Relative FLOPs to match Dense Performance (Image) | 100.0 | 34.8 | -65.2 |
| Chameleon Pre-training | Relative FLOPs to match Dense Performance (Text) | 100.0 | 55.8 | -44.2 |
| 7B Scale Speech Extension (Text + Image + Speech): MoT shows even larger gains for the new modality. | ||||
| Speech Pre-training (LibriLight/SpiRit-LM) | Relative FLOPs to match Dense Performance (Speech) | 100.0 | 37.2 | -62.8 |
| Transfusion Setting (Text Autoregressive + Image Diffusion): MoT outperforms Dense on image generation metrics. | ||||
| Transfusion Image Generation | Validation Loss | 0.126 | 0.120 | -0.006 |