| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| FrozenLake results highlight the failure of CoT in complex environments and MVoT's robustness. | ||||
| FrozenLake | Accuracy | 78.60 | 85.60 | +7.00 |
| FrozenLake (6x6 Grid) | Accuracy | 39.11 | 83.00 | +43.89 |
| On simpler tasks where text descriptions are sufficient, MVoT remains competitive but does not exceed CoT. | ||||
| Maze | Accuracy | 95.00 | 92.95 | -2.05 |
| MiniBehavior | Accuracy | 95.00 | 95.14 | +0.14 |