| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Initial analysis reveals a massive performance gap between disambiguous and ambiguous contexts across all models, suggesting the 'bias' is largely a failure to handle ambiguity. | ||||
| BBQ | EMO | 82.21 | 13.62 | -68.59 |
| BBQ | EMO | 80.17 | 6.88 | -73.29 |
| BBQ (Ambiguous) | Bias Reinforcement | 100.00 | 18.50 | -81.50 |
| Generalizability checks on SQuAD-v2 confirm the flaw is task-specific (ambiguity) not identity-specific. | ||||
| SQuAD-v2 | EMO | 50.36 | 7.90 | -42.46 |