Evaluation Setup
Pre-fine-tuning on selected data followed by targeted fine-tuning (or zero-shot eval)
Benchmarks:
- GPT-2 Toxicity Reduction (Safety/NLG)
- Domain Adaptation Tasks (NLU (8 tasks from Gururangan et al. 2020))
- Zero-shot Evaluation (General Capabilities)
Metrics:
- Toxicity Level
- Task Performance (Accuracy/F1)
- Zero-shot Performance
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| GPT-2 Toxicity |
Toxicity Reduction |
0 |
30 |
30
|
| 8 Domain-specific tasks |
Average Performance |
Not reported in the paper |
Not reported in the paper |
1.13
|
| Zero-shot tasks (models up to 2.7B) |
Task Performance |
Not reported in the paper |
Not reported in the paper |
13.9
|
Main Takeaways
- GOT-D consistently outperforms existing selection methods, particularly in low-budget regimes (e.g., 10k-50k samples).
- The method is computationally efficient, scaling to millions of candidate samples in minutes using GPU acceleration.
- Visualizations (mentioned in text) show GOT-D selects samples underrepresented in pre-training but important for the target, confirming the theoretical intuition.