Evaluation Setup
Multi-task user-specific classification
Benchmarks:
- PEFT-U Benchmark (Personalized Classification (Hate Speech, Sentiment, Humor)) [New]
Metrics:
- Average per-user accuracy per task
- Average accuracy across all tasks
- Statistical methodology: Multiple runs with varied random seeds (reported in methodology, though specific confidence intervals are not in text text)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| PEFT-U (All 13 Tasks) |
Average Accuracy |
59.5 |
64.4 |
+4.9
|
| PEFT-U (Task Count) |
Number of tasks won |
1 |
12 |
+11
|
Main Takeaways
- Personalized fine-tuning methods (Adapters, LoRA) consistently outperform zero-shot and few-shot prompting, confirming that prompt context alone is insufficient for capturing complex user subjectivity.
- Adapters (bottleneck layers) generally outperform LoRA (rank decomposition) in this specific benchmark setting, achieving the highest accuracy on 12/13 tasks.
- Performance varies significantly across tasks (e.g., Subjective Discourse vs. MeasuringHateSpeech), indicating the benchmark presents a multifaceted challenge.
- Parameter efficiency trade-off: While Adapters perform best overall, LoRA can outperform Adapters if the number of trainable parameters is equalized (as shown in the TweetEval ablation).