Evaluation Setup
Decoding performance evaluated on text generation tasks using both synthetic and real-world datasets
Benchmarks:
- Synthetic and Real Datasets (Text Generation / Instruction Following)
Metrics:
- Average Reward
- Win-Tie Rate (GPT-4 evaluation)
- Coherence
- Diversity
- Quality
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Aggregate performance metrics comparing Transfer Q* (TQ*) against the Controlled Decoding (CD) baseline across tested datasets. |
| Aggregate (All Datasets) |
Average Reward Improvement (Ratio) |
1.00 |
1.45 |
+0.45
|
| Aggregate (All Datasets) |
Win-Tie Rate (GPT-4) |
32.66 |
67.34 |
+34.68
|
Main Takeaways
- TQ* significantly outperforms Controlled Decoding (CD) by using a better estimator for the optimal value function derived from aligned baselines.
- The method is effective even when the baseline model is aligned to a different reward than the target, validating the 'Indirect Transfer' capability.
- TQ* consistently produces higher quality, more coherent, and more diverse responses compared to baselines like ARGS and CD.