Evaluation Setup
Controlled generation on toxicity and sentiment tasks
Benchmarks:
- RealToxicityPrompts (Detoxification)
- OpenWebText (Sentiments) (Sentiment steering)
Metrics:
- Average Max Toxicity
- Toxic Rate
- Diversity (distinct n-grams)
- Fluency (Perplexity)
- Positive Rate
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LLaMA-65B decoding |
Relative Computational Overhead |
0.00 |
0.03 |
+0.03
|
| Jigsaw Unintended Bias |
Squared Error (MSE) |
0.0000 |
0.0147 |
+0.0147
|
Main Takeaways
- RAD provides the lowest Average Max Toxicity among all evaluated methods, including those involving expensive re-training (like PPO/Quark)
- The method enables effective trading off between attribute alignment (toxicity/sentiment) and fluency by tuning the beta parameter
- Computational overhead becomes negligible (3%) when the base language model (e.g., LLaMA-65B) is significantly larger than the reward model (GPT-2 Small)
- Unidirectionality of the reward model is critical for scaling, reducing scoring complexity from quadratic to linear relative to sequence length