Evaluation Setup
Text-to-Image Generation across three distinct tasks: Aesthetics, Fairness, and Compositionality
Benchmarks:
- PartiPrompts (General text-to-image generation (challenging prompts))
- DiffusionDB (test split) (Human preference evaluation)
- HRSBench (Fairness/Bias evaluation)
- Custom Composition Set (Object composition (spatial relationships)) [New]
Metrics:
- Human Preference (Head-to-head win rate)
- ImageReward Score
- Aesthetic Score
- Statistical Parity (L2 distance from uniform)
- Object Detection Confidence
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on Human Preference and Aesthetics (PartiPrompts). The proposed method outperforms baselines in human evaluation. |
| PartiPrompts |
Human Preference Win Rate (vs SDv2) |
59.3 |
80.3 |
+21.0
|
| PartiPrompts |
Aesthetic Score |
6.05 |
6.24 |
+0.19
|
| PartiPrompts |
ImageReward Score |
1.23 |
1.13 |
-0.10
|
| Performance on Compositionality. The method improves adherence to object relationships. |
| Composition Test Set (Seen Objects) |
Object Detection Confidence |
0.456 |
0.781 |
+0.325
|
| Composition Test Set (Unseen Objects) |
Object Detection Confidence |
0.432 |
0.720 |
+0.288
|
| Multi-Task Joint Training. Shows the model can improve all metrics simultaneously. |
| Multi-task Evaluation |
ImageReward |
0.36 |
0.85 |
+0.49
|
| Multi-task Evaluation |
Statistical Parity (lower is better) |
0.334 |
0.082 |
-0.252
|
Main Takeaways
- RL fine-tuning scales effectively to millions of prompts, converging faster (~1k steps) than gradient-based methods (~4k steps) like DRaFT.
- Distribution-based rewards successfully mitigate skintone bias without needing curated balanced datasets.
- Multi-objective training prevents the 'alignment tax': the joint model retains >80% of the performance of single-task specialists while improving on all fronts compared to the base model.
- ReFL and other direct reward optimization methods are prone to 'reward hacking', generating high-scoring but repetitive or low-quality images, whereas RL (PPO) is more robust.