Evaluation Setup
Fine-tuning aligned models on mixtures of benign and potentially unsafe data, then evaluating safety and utility.
Benchmarks:
- Anthropic HH (Safety and Helpfulness Dialogue)
- Orca (Instruction Tuning)
- HEx-PHI (Safety Evaluation)
Metrics:
- Win Rate (judged vs. baseline)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| SEAL consistently improves win rates against random data selection baselines across different model architectures. |
| Anthropic HH / Orca (Combined) |
Win Rate Increase |
0.0 |
8.5 |
+8.5
|
| Anthropic HH / Orca (Combined) |
Win Rate Increase |
0.0 |
9.7 |
+9.7
|
Main Takeaways
- SEAL effectively filters out harmful data that conflicts with safety alignment, leading to higher win rates compared to random selection.
- The method is robust across different model architectures (Llama-3, Merlinite, Pythia).
- Data selection weights learned by SEAL are interpretable: selected data shows qualitatively superior safety compared to filtered-out data.
- SEAL exhibits transferability: a selector trained with a smaller proxy model works effectively for fine-tuning larger models.