Evaluation Setup
5-class classification (CT+, CT-, PS+, PS-, Uu) and supporting word extraction
Benchmarks:
- Maven-Fact Test Set (Event Factuality Detection) [New]
Metrics:
- Precision, Recall, F1 (per class)
- Macro-F1
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Maven-Fact |
Macro-F1 |
46.1 |
47.6 |
+1.5
|
| Maven-Fact |
Macro-F1 |
47.6 |
40.2 |
-7.4
|
| Maven-Fact |
Macro-F1 |
40.2 |
42.8 |
+2.6
|
| Maven-Fact |
F1 |
24.4 |
27.0 |
+2.6
|
Main Takeaways
- Maven-Fact is challenging: The best macro F1 is only 47.6%, much lower than typical results on FactBank, likely due to data diversity and scale.
- LLMs struggle with fine-grained factuality: GPT-4 trails fine-tuned BERT models by ~5-7 points.
- Arguments and Relations help fine-tuned models: Adding these features improves performance for DMRoBERTa/GenEFD, but confusingly hurts LLM in-context learning performance.
- Supporting evidence is hard to find: Models often predict the correct label but fail to identify the correct supporting words (e.g., 'might').