← Back to Paper List

MAVEN-Fact: A Large-scale Event Factuality Detection Dataset

Chunyang Li, Hao Peng, Xiaozhi Wang, Yunjia Qi, Lei Hou, Bin Xu, Juanzi Li
Tsinghua University, Beijing, China
arXiv (2024)
Factuality Benchmark KG

📝 Paper Summary

Event Understanding Event Factuality Detection (EFD) Hallucination Mitigation
Maven-Fact is a large-scale dataset for event factuality detection constructed via an LLM-then-human pipeline, enabling comprehensive event understanding and analysis of factuality in large language models.
Core Problem
Event Factuality Detection (EFD) is under-explored due to the lack of large-scale, high-quality datasets; existing datasets are small (e.g., FactBank has <10k events) and lack comprehensive annotations for arguments and relations.
Why it matters:
  • Mistaking a mere possibility (e.g., 'might celebrate') for a fact leads to erroneous judgments in downstream applications
  • Current small-scale datasets are insufficient for training robust models or evaluating Large Language Models (LLMs) effectively
  • Lack of supporting evidence annotations prevents models from explaining why an event is classified as non-factual, reducing interpretability
Concrete Example: In the sentence 'They might celebrate the victory,' the event 'celebrate' is a possibility, not a fact. If an application ignores the modal word 'might' and treats 'celebrate' as a fact, it generates false information.
Key Novelty
Maven-Fact: A Large-scale, Comprehensive EFD Dataset
  • Constructs the largest EFD dataset (112,276 events) by extending the MAVEN dataset, integrating factuality labels with existing event types, arguments, and relations
  • Introduces a cost-effective 'LLM-then-human' annotation pipeline where GPT-3.5 pre-filters factual events (CT+) and humans meticulously annotate non-factual/complex cases
  • Includes annotations for 'supporting evidence' (specific words like 'may' or 'not') that justify the factuality label, enabling explainability research
Evaluation Highlights
  • Fine-tuned GenEFD achieves 45.1% Macro-F1, significantly outperforming GPT-4 (40.2% Macro-F1) on the 5-class factuality task
  • Chain-of-Thought (CoT) prompting improves GPT-4's performance to 42.8% Macro-F1, but it still lags behind fine-tuned baselines
  • Supporting word prediction is challenging: even the best model (LLAMA 3 + CoT) achieves only 27.0% F1, indicating models struggle to identify evidence even when classifying factuality correctly
Breakthrough Assessment
8/10
Significantly scales up EFD data availability (10x larger than FactBank) and integrates it with structural event knowledge. The LLM-then-human construction methodology is practical and effective.
×