| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| JRM significantly outperforms state-of-the-art baselines, including GPT-5, on standard reward modeling benchmarks. | ||||
| EditReward-Bench | Overall Accuracy | 75.5 | 85.1 | +9.6 |
| EditReward-Bench | Prompt Following Accuracy | Not reported in the paper | 85.4 | Not reported in the paper |
| MMRB2 | Composite Score | 61.9 | 69.3 | +7.4 |
| Representation analysis shows joint training prevents feature collapse. | ||||
| Representation Space Analysis | Effective Feature Space Rank | 46.86 | 91.77 | +44.91 |
| Downstream RL experiments demonstrate JRM guides generation models better than GPT-4.1. | ||||
| GEdit-Bench | Performance Gain | 0.45 | 1.00 | +0.55 |
| ImageEdit-Bench | Performance Gain | 0.26 | 0.50 | +0.24 |