Evaluation Setup
Generation of counterspeech for toxic comments in political Reddit threads, evaluated by humans and metrics
Benchmarks:
- Custom Reddit Political Dataset (Counterspeech Generation) [New]
Metrics:
- Human: Relevance
- Human: Adequacy
- Human: Truthfulness
- Human: Artificiality
- Human: Persuasiveness (Civil Re-engagement & Steering)
- Automated: ROUGE (Relevance/Diversity/Personalization)
- Automated: FRES (Readability)
- Automated: Toxicity (Perspective API)
- Statistical methodology: Friedman tests for within-subjects differences; paired Wilcoxon signed-rank tests with Bonferroni correction for pairwise comparisons; Mann-Whitney U for between-subjects.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Human evaluation results showing the superiority of contextualized models over the baseline. |
| Reddit Political Dataset |
Adequacy (Rank Biserial Correlation vs Baseline) |
0.00 |
0.59 |
+0.59
|
| Reddit Political Dataset |
Persuasiveness-Steering (Rank Biserial Correlation vs Baseline) |
0.00 |
0.38 |
+0.38
|
| Reddit Political Dataset |
Relevance (Rank Biserial Correlation vs Baseline) |
0.00 |
0.58 |
+0.58
|
| Automated metrics results often contradicted human findings, showing high performance for models humans rated poorly. |
| Reddit Political Dataset |
Diversity (1 - ROUGE) |
0.58 |
0.76 |
+0.18
|
Main Takeaways
- Contextualized counterspeech (using user summaries and conversation history) is perceived by humans as significantly more adequate and persuasive than generic baselines.
- Providing the model with a summary of the user's history [Su] is more effective for personalization than feeding raw past comments [Hi].
- There is a distinct lack of correlation between standard automated metrics (ROUGE, Toxicity) and human quality judgments, suggesting current metrics are insufficient for evaluating persuasive counterspeech.
- Combining adaptation (conversation context) and personalization (user context) yields the best results compared to using either strategy in isolation.