| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| GuardReasoner 8B consistently outperforms baselines on Prompt Harmfulness tasks, especially on adversarial benchmarks. | ||||
| Average Prompt Harmfulness (6 benchmarks) | F1 | 63.20 | 81.09 | +17.89 |
| ToxicChat (Adversarial) | F1 | 73.91 | 79.27 | +5.36 |
| In Response Harmfulness tasks, GuardReasoner 8B achieves state-of-the-art performance. | ||||
| Average Response Harmfulness (5 benchmarks) | F1 | 74.45 | 81.22 | +6.77 |
| Ablation studies confirm the value of both R-SFT and HS-DPO. | ||||
| HarmBench Prompt | F1 | 75.05 | 81.39 | +6.34 |