Evaluation Setup
Multiple distinct evaluation settings across chapters: Semi-supervised classification, Efficient Fine-tuning, and Instruction Following.
Benchmarks:
- GLUE / SuperGLUE (NLU Classification)
- AlpacaEval (Open-ended Instruction Following)
- StepGame (Multi-hop Spatial Reasoning) [New]
Metrics:
- Accuracy
- Win Rate (AlpacaEval)
- Inference Latency / Memory Usage
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Instruction Modelling (IM) significantly improves win rates on open-ended generation benchmarks compared to standard training, especially in low-data regimes. |
| AlpacaEval 1.0 |
Win Rate |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
| Decomposed Prompt Tuning (DePT) achieves efficiency gains over vanilla Prompt Tuning. |
| Efficiency Metrics |
Memory Cost reduction |
1.0 |
0.8 |
-0.2
|
Main Takeaways
- Task-Adaptive Pre-training (TAPT) is often a more robust semi-supervised baseline than complex Self-Training methods.
- Prompt-based Continued Pre-training (PCP) is essential for prompt-based fine-tuning; standard continued pre-training can actually hurt performance in these setups.
- Instruction Modelling (IM) is highly effective for reducing overfitting when training data has long instructions and short outputs (e.g., logic puzzles, classification tasks posed as chat).
- DePT successfully decouples the trade-off between prompt expressivity (length) and computational efficiency.