Evaluation Setup
Instruction fine-tuning evaluation on 8 text-editing datasets across 6 tasks (Simplification, Coherence, Clarity, Fluency, Grammar Correction, Neutralization).
Benchmarks:
- TurkCorpus (Simplification)
- Asset (Simplification)
- Iterator (Coherence, Clarity, Fluency, Global) (Text Improvement)
- JFLEG (Grammar Correction)
- WNC (Neutralization)
Metrics:
- SARI
- ROUGE-L
- Perceived Accuracy (Human Eval)
- Statistical methodology: Inter-rater reliability (Fleiss-Kappa) reported for human evaluation (0.44).
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of DEFT-UCS (using 32.5% data) against the full CoEDIT model (100% data) shows DEFT-UCS outperforms or matches the baseline. |
| TurkCorpus |
SARI |
43.7 |
46.6 |
+2.9
|
| Asset |
SARI |
44.7 |
46.8 |
+2.1
|
| Iterator Fluency |
SARI |
64.7 |
64.7 |
0.0
|
| Iterator Clarity |
SARI |
61.3 |
61.8 |
+0.5
|
| WNC |
SARI |
80.2 |
79.0 |
-1.2
|
| Comparison against LIMA-style sampling (1k random diverse samples). |
| TurkCorpus |
SARI |
23.8 |
46.6 |
+22.8
|
Main Takeaways
- Hard sampling (selecting examples farthest from cluster centroids) is more effective than random or easy sampling when the initial base dataset size is small.
- Sentence-T5 embeddings provide better cluster separation for text-editing tasks compared to BART CLS or Flan-T5 average word embeddings.
- Subjective tasks like Neutralization (WNC) require more data (>80%) to match baseline performance compared to Simplification tasks (Asset), which need only ~12%.
- DEFT-UCS models generate edits perceived as accurate by humans 83.8% of the time, surpassing CoEDIT's 70.5%.