Evaluation Setup
Controlled fine-tuning on reasoning tasks followed by broad benchmarking.
Benchmarks:
- HumanEval (Coding (Python generation))
- MATH (Mathematics (High school competition))
- GSM-Plus (Harder variant of GSM8K math reasoning)
- TheoremQA (STEM theorem application)
- LeetCode (Interview-level programming)
Metrics:
- Pass@1 Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| GRAPE significantly outperforms standard baselines and stronger teacher models on aggregated benchmarks. |
| Average across benchmarks |
Accuracy Gain |
Not explicitly reported as single aggregate number (implied baseline) |
Not explicitly reported as single aggregate number |
+13.8%
|
| Average across benchmarks |
Accuracy Gain |
Not explicitly reported as single aggregate number |
Not explicitly reported as single aggregate number |
+17.3%
|
| Tulu3 Benchmark Suite |
Average Performance |
Not reported as exact number in text |
Not reported as exact number in text |
+3.5%
|
| General-domain Instruction Tuning |
Average Performance |
Not reported as exact number in text |
Not reported as exact number in text |
+3.9%
|
Main Takeaways
- Alignment with the base model's pre-trained distribution is a critical, often overlooked factor in SFT data selection.
- Data quantity has diminishing returns; selecting 'fitting' data outperforms simply scaling up data volume by 3x.
- The 'strongest' teacher (e.g., 405B model) does not necessarily produce the best training data for smaller models; the gap can be significant.
- GRAPE is efficient: it requires only inference (forward passes) to select data, with no complex iterative training loops.