| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Self-MOA significantly improves safety scores across multiple attack datasets compared to both the unaligned base model and the PKU-RLHF baseline. | ||||
| Attack Datasets (Average) | Safety Improvement % | 0.0 | 41.2 | +41.2 |
| SaladBench | Safety Improvement % | 0.0 | 35.0 | +35.0 |
| Attack Datasets (Average) | Safety Score Improvement vs Baseline | 0.0 | 17.1 | +17.1 |
| SaladBench | Safety Score Improvement vs Baseline | 0.0 | 12.3 | +12.3 |
| Training Data Usage | Dataset Size Factor | 11 | 1 | -10 |