DEFT-UCS: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection for Text-Editing

📝 Paper Summary

Data-Efficient Fine-Tuning Text Editing / Rewriting Core-Set Selection

DEFT-UCS reduces fine-tuning data requirements by clustering unlabeled text embeddings and sampling hard/easy examples to train high-performance text-editing models.

Core Problem

Fine-tuning Pre-trained Language Models (PLMs) typically requires large amounts of high-quality data, which is expensive and difficult to acquire for niche domains.

Why it matters:

Real-world applications often lack the massive annotated datasets (e.g., 52k-395k samples) used by standard instruction-tuning methods
Existing pruning metrics (EL2N, perplexity) require labeled task data or reference models, making them impractical when annotations are scarce
Current methods focus on parameter efficiency (PEFT) rather than data efficiency, missing the opportunity to reduce annotation costs

Concrete Example: To fine-tune a model for text simplification using the Asset dataset, standard approaches use the full dataset. DEFT-UCS achieves superior SARI scores using only ~12% of the original CoEDIT training data by selectively sampling 'hard' examples farthest from cluster centroids.

Key Novelty

Unsupervised Clustering-based Core-Set Selection (DEFT-UCS) for Text Editing

Embeds training instructions/inputs into a latent space (using Sentence-T5) and groups them using K-Means clustering
Selects a 'core-set' of data by sampling specific ratios of 'easy' (close to centroid) and 'hard' (far from centroid) examples from each cluster without needing labels
Demonstrates that models fine-tuned on just 32.5% of data (via hard sampling) can match or beat models trained on the full dataset

Architecture

Conceptual framework of DEFT-UCS showing the data selection process.

Evaluation Highlights

DEFT-UCS model trained on 32.5% of CoEDIT data surpasses the state-of-the-art CoEDIT model on 6 out of 8 evaluation datasets
+4.2 SARI improvement on the Iterator Fluency dataset compared to the full-data CoEDIT baseline
Human evaluators preferred or found DEFT-UCS edits accurate 83.8% of the time, compared to 70.5% for the full CoEDIT model

Breakthrough Assessment

7/10

Strong empirical evidence that unsupervised pruning works for generative text tasks, challenging the need for massive datasets. However, relies on existing clustering techniques rather than a novel algorithm.

⚙️ Technical Details

Problem Definition

Setting: Instruction-based fine-tuning of a PLM using a pruned subset of a large dataset

Inputs: Large dataset D (instructions + input text), PLM P

Outputs: Fine-tuned model M_Dc trained on core-set Dc

Pipeline Flow

Embedding Generation: Compute embeddings for the large dataset D_remain
Clustering: Apply K-Means to partition embeddings into K clusters
Sampling: Calculate distance to centroids; select subset D_sampled (easy/hard/random) from each cluster
Fine-Tuning: Train PLM on D_base + D_sampled

System Modules

Embedding Encoder (Data Selection)

Convert input text instructions into vector representations

Model or implementation: Sentence-T5 (Ni et al., 2021)

Clustering & Sampler (Data Selection)

Group data by semantic similarity and select core-set based on centroid distance

Model or implementation: K-Means (K=7)

Editor Model

Perform text editing tasks (simplification, grammar correction, etc.)

Model or implementation: Flan-T5 Large (fine-tuned)

Novel Architectural Elements

Application of unsupervised clustering-based core-set selection (specifically hard/easy sampling via centroid distance) to the domain of generative text editing PLMs

Modeling

Base Model: Flan-T5 Large

Training Method: Full fine-tuning on the selected core-set

Objective Functions:

Purpose: Minimize difference between generated and target text.

Formally: Standard sequence-to-sequence cross-entropy loss.

Training Data:

Source: CoEDIT dataset (82k samples)
Subset: 32.5% of total data (stratified base set + hard samples from UCS)

Key Hyperparameters:

learning_rate: 1e-4
optimizer: Adam
epochs: 5
+ 4 more
max_sequence_length: 256
K (clusters): 7
alpha (easy weight): 0.0 (for best model)
beta (hard weight): 1.0 (for best model)

Compute: 4 A10G GPUs (AWS G5 instances)

Comparison to Prior Work

vs. LIMA: DEFT-UCS uses distance-based sampling (hard examples) rather than just manual quality/diversity heuristics, resulting in significantly higher performance
vs. CoEDIT: Matches or beats CoEDIT performance using 70% less data
vs. Skill-It!: Does not require learning hierarchical relationships or dependency graphs [not cited in paper]

Limitations

Hyperparameters (K, alpha, beta) are manually selected based on task knowledge rather than automated.
Sampling method relies on embedding distance to centroid, which may not capture all linguistic nuances.
Evaluated only on text editing tasks; generalization to other NLP tasks (summarization, translation) is unverified.
Requires an initial base set (D_base) for stability, not purely zero-shot selection.

Reproducibility

Code URL not provided in paper. Uses public datasets (CoEDIT, TurkCorpus, Asset, etc.) and public base models (Flan-T5). Hyperparameters matches CoEDIT baseline for fair comparison.

📊 Experiments & Results

Evaluation Setup

Instruction fine-tuning evaluation on 8 text-editing datasets across 6 tasks (Simplification, Coherence, Clarity, Fluency, Grammar Correction, Neutralization).

Benchmarks:

TurkCorpus (Simplification)
Asset (Simplification)
Iterator (Coherence, Clarity, Fluency, Global) (Text Improvement)
JFLEG (Grammar Correction)
WNC (Neutralization)

Metrics:

SARI
ROUGE-L
Perceived Accuracy (Human Eval)
Statistical methodology: Inter-rater reliability (Fleiss-Kappa) reported for human evaluation (0.44).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of DEFT-UCS (using 32.5% data) against the full CoEDIT model (100% data) shows DEFT-UCS outperforms or matches the baseline.
TurkCorpus	SARI	43.7	46.6	+2.9
Asset	SARI	44.7	46.8	+2.1
Iterator Fluency	SARI	64.7	64.7	0.0
Iterator Clarity	SARI	61.3	61.8	+0.5
WNC	SARI	80.2	79.0	-1.2
Comparison against LIMA-style sampling (1k random diverse samples).
TurkCorpus	SARI	23.8	46.6	+22.8

Experiment Figures

SARI and ROUGE-L scores of DEFT-UCS vs CoEDIT across varying fractions of training data.

Comparison of Hard vs. Easy vs. Random sampling strategies.

Win percentage of different sampling strategies as the size of D_base increases.

Main Takeaways

Hard sampling (selecting examples farthest from cluster centroids) is more effective than random or easy sampling when the initial base dataset size is small.
Sentence-T5 embeddings provide better cluster separation for text-editing tasks compared to BART CLS or Flan-T5 average word embeddings.
Subjective tasks like Neutralization (WNC) require more data (>80%) to match baseline performance compared to Simplification tasks (Asset), which need only ~12%.
DEFT-UCS models generate edits perceived as accurate by humans 83.8% of the time, surpassing CoEDIT's 70.5%.

📚 Prerequisite Knowledge

Prerequisites

K-Means Clustering
Instruction Fine-Tuning
Sentence Embeddings
SARI and ROUGE metrics

Key Terms

Core-Set Selection: Finding a smaller subset of training data that approximates the performance of the full dataset

SARI: System Output vs References and Input—a metric for text editing that measures goodness of words added, kept, and deleted

Sentence-T5: A T5-based model optimized to produce semantically meaningful sentence embeddings

Hard Examples: Data points located furthest from their cluster centroid in the embedding space

Easy Examples: Data points located closest to their cluster centroid in the embedding space

CoEDIT: A state-of-the-art text editing model fine-tuned on 82k instruction-based examples across multiple editing tasks

EL2N: Error L2 Norm—a metric often used in supervised pruning to identify difficult examples based on error magnitude