| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance comparison against baseline demonstration selection strategies across different task families. | ||||
| GLUE (Avg) | Accuracy | 51.8 | 59.2 | +7.4 |
| Ethos (Avg) | Accuracy | 66.5 | 67.4 | +0.9 |
| TweetEval (Avg) | Accuracy | 60.4 | 67.8 | +7.4 |
| HateSpeech18 | Accuracy | 51.3 | 63.0 | +11.7 |
| Poem Sentiment | Accuracy | 66.0 | 73.2 | +7.2 |