| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| GPT-4o evaluation on Real-Edit benchmark shows Multi-Reward conditioning significantly improves instruction following and detail preservation over the InsPix2Pix baseline. | ||||
| Real-Edit | Following (Accuracy) | 47.7 | 57.1 | +9.4 |
| Real-Edit | Preserving (Score 0-5) | 4.21 | 4.50 | +0.29 |
| Real-Edit | Quality (Score 0-5) | 4.33 | 4.41 | +0.08 |
| Human evaluation confirms the improvements seen in automated metrics, with the proposed method achieving the best scores across all three categories. | ||||
| Real-Edit (Subset) | Following (Score 0-5) | 3.37 | 4.08 | +0.71 |
| Real-Edit (Subset) | Preserving (Score 0-5) | 3.67 | 4.00 | +0.33 |
| Compatibility experiments show the Multi-Reward framework also improves the stronger SmartEdit baseline. | ||||
| Real-Edit | Following (Accuracy) | 61.3 | 63.0 | +1.7 |