Evaluation Setup
Evaluation on standard LLM benchmarks for chat, reasoning, and truthfulness.
Benchmarks:
- HuggingFace Open LLM Leaderboard (General capabilities (ARC, HellaSwag, MMLU, TruthfulQA, etc.))
- MT-Bench (Multi-turn conversation quality)
- Big-Bench Hard (BBH) (Challenging reasoning tasks)
Metrics:
- Average score
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on HuggingFace Open LLM Leaderboard showing iterative improvement. |
| TruthfulQA |
Accuracy |
44.35 |
52.54 |
+8.19
|
| GSM8k |
Accuracy |
49.81 |
57.54 |
+7.73
|
| Comparison against DPO training on MT-Bench. |
Main Takeaways
- SPIN consistently improves model performance across multiple iterations (0 -> 1 -> 2 -> 3), preventing the plateau seen in standard iterative SFT.
- The method is data-efficient, utilizing only a 50k subset of the original SFT data to achieve results comparable to or better than models trained on large external preference datasets.
- SPIN effectively leverages the LLM's own generative capabilities to create a 'stronger' opponent, driving the main model to align closer to the target distribution.