| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Analysis of the benchmark composition reveals significant variance in difficulty across languages and model-specific failure modes. | ||||
| HalloMTBench | Instance Count (Portuguese) | Not reported in the paper | 1025 | N/A |
| HalloMTBench | Instance Count (Chinese) | Not reported in the paper | 51 | N/A |
| HalloMTBench | Extraneous Addition Rate (Qwen3-Max) | Not reported in the paper | 68.8% | N/A |
| HalloMTBench | Incorrect Language Rate (GPT-4o-mini) | Not reported in the paper | 69.2% | N/A |