| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of the final Beaver-v3 model against the base Alpaca-7B model showing improvements in both helpfulness and safety. | ||||
| Evaluation Prompt Set (Human Eval) | Harmful Response Ratio | 53.08 | 2.45 | -50.63 |
| Evaluation Prompt Set (GPT-4 Eval) | Helpfulness Elo | 1000 | 1244.91 | +244.91 |
| Evaluation Prompt Set (GPT-4 Eval) | Harmlessness Elo | 1000 | 1268.31 | +268.31 |
| Ablation study comparing Safe RLHF dynamic optimization against Reward Shaping (static weighting). | ||||
| Evaluation Prompt Set | Harmlessness Win Rate vs SFT | 56.0 | 62.0 | +6.0 |