MAVEN-Fact: A Large-scale Event Factuality Detection Dataset

📝 Paper Summary

Event Understanding Event Factuality Detection (EFD) Hallucination Mitigation

Maven-Fact is a large-scale dataset for event factuality detection constructed via an LLM-then-human pipeline, enabling comprehensive event understanding and analysis of factuality in large language models.

Core Problem

Event Factuality Detection (EFD) is under-explored due to the lack of large-scale, high-quality datasets; existing datasets are small (e.g., FactBank has <10k events) and lack comprehensive annotations for arguments and relations.

Why it matters:

Mistaking a mere possibility (e.g., 'might celebrate') for a fact leads to erroneous judgments in downstream applications
Current small-scale datasets are insufficient for training robust models or evaluating Large Language Models (LLMs) effectively
Lack of supporting evidence annotations prevents models from explaining why an event is classified as non-factual, reducing interpretability

Concrete Example: In the sentence 'They might celebrate the victory,' the event 'celebrate' is a possibility, not a fact. If an application ignores the modal word 'might' and treats 'celebrate' as a fact, it generates false information.

Key Novelty

Maven-Fact: A Large-scale, Comprehensive EFD Dataset

Constructs the largest EFD dataset (112,276 events) by extending the MAVEN dataset, integrating factuality labels with existing event types, arguments, and relations
Introduces a cost-effective 'LLM-then-human' annotation pipeline where GPT-3.5 pre-filters factual events (CT+) and humans meticulously annotate non-factual/complex cases
Includes annotations for 'supporting evidence' (specific words like 'may' or 'not') that justify the factuality label, enabling explainability research

Architecture

The Chain-of-Thought (CoT) prompt used for the LLM-based pre-annotation step.

Evaluation Highlights

Fine-tuned GenEFD achieves 45.1% Macro-F1, significantly outperforming GPT-4 (40.2% Macro-F1) on the 5-class factuality task
Chain-of-Thought (CoT) prompting improves GPT-4's performance to 42.8% Macro-F1, but it still lags behind fine-tuned baselines
Supporting word prediction is challenging: even the best model (LLAMA 3 + CoT) achieves only 27.0% F1, indicating models struggle to identify evidence even when classifying factuality correctly

Breakthrough Assessment

8/10

Significantly scales up EFD data availability (10x larger than FactBank) and integrates it with structural event knowledge. The LLM-then-human construction methodology is practical and effective.

⚙️ Technical Details

Problem Definition

Setting: Multi-class classification of event factuality and extraction of supporting evidence spans

Inputs: A document containing marked event triggers

Outputs: A factuality label from {CT+, CT-, PS+, PS-, Uu} and supporting words (if non-factual)

Pipeline Flow

LLM Pre-annotation (GPT-3.5 binary classification)
Human Annotation (verify/refine labels & extract evidence)
Model Training/Evaluation (Fine-tuning & In-context Learning)

System Modules

LLM Pre-annotator (Data Construction)

Filter out clear factual events (CT+) to save human effort

Model or implementation: GPT-3.5

Human Annotator (Data Construction)

Assign fine-grained 5-class labels and extract supporting words for non-CT+ candidates

Model or implementation: Human (Commercial Annotation Team)

Novel Architectural Elements

LLM-then-human annotation workflow specifically optimized for high-imbalance tasks (where >80% of events are factual facts)

Modeling

Base Model: Evaluated multiple: BERT, RoBERTa, GenEFD (FLAN-T5 based), Mistral 7B, LLAMA 3 (8B), GPT-3.5, GPT-4

Training Method: Supervised Fine-Tuning (for BERT/RoBERTa/GenEFD) and In-Context Learning (for LLMs)

Objective Functions:

Purpose: Classification.

Formally: Cross-entropy loss for encoder-only models.
Purpose: Generation.

Formally: Language modeling loss (seq2seq) for GenEFD.

Training Data:

Training set: 16,950 CT+ and 56,989 non-CT+ (from LLM pre-filtering + human verification)
Validation/Test sets: Fully human-annotated

Key Hyperparameters:

shots: 5-shot for In-Context Learning

Compute: Not reported in the paper

Comparison to Prior Work

vs. FactBank: Maven-Fact is ~10x larger and includes supporting evidence and linked argument/relation annotations
vs. Standard Annotations: Uses LLM-filtering to reduce cost by 15% while maintaining high recall [not cited in paper]

Limitations

Macro F1 scores are generally low (<50%), indicating the task is very difficult for current models
LLMs perform significantly worse than fine-tuned smaller models on this task
Supporting word prediction performance is very low, limiting interpretability

Reproducibility

Code: https://github.com/THU-KEG/MAVEN-FACT

Dataset and code are publicly available at https://github.com/THU-KEG/MAVEN-FACT. The prompt used for annotation is detailed in Figure 3. LLM versions (GPT-3.5/4) are specified but API versions/dates are not.

📊 Experiments & Results

Evaluation Setup

5-class classification (CT+, CT-, PS+, PS-, Uu) and supporting word extraction

Benchmarks:

Maven-Fact Test Set (Event Factuality Detection) [New]

Metrics:

Precision, Recall, F1 (per class)
Macro-F1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Maven-Fact	Macro-F1	46.1	47.6	+1.5
Maven-Fact	Macro-F1	47.6	40.2	-7.4
Maven-Fact	Macro-F1	40.2	42.8	+2.6
Maven-Fact	F1	24.4	27.0	+2.6

Main Takeaways

Maven-Fact is challenging: The best macro F1 is only 47.6%, much lower than typical results on FactBank, likely due to data diversity and scale.
LLMs struggle with fine-grained factuality: GPT-4 trails fine-tuned BERT models by ~5-7 points.
Arguments and Relations help fine-tuned models: Adding these features improves performance for DMRoBERTa/GenEFD, but confusingly hurts LLM in-context learning performance.
Supporting evidence is hard to find: Models often predict the correct label but fail to identify the correct supporting words (e.g., 'might').

📚 Prerequisite Knowledge

Prerequisites

Event Extraction (specifically definitions of triggers and arguments)
Classification metrics (Precision, Recall, F1)

Key Terms

EFD: Event Factuality Detection—determining the truth value (fact, possibility, etc.) of an event mentioned in text

CT+: Certain Positive—the event certainly happened

CT-: Certain Negative—the event certainly did not happen

PS+: Probable Positive—the event possibly happened (high probability)

PS-: Probable Negative—the event possibly did not happen (low probability)

Uu: Underspecified/Unknown—factuality cannot be determined or is conditional

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps

supporting evidence: Words in the text that explicitly convey the modality or polarity of the event (e.g., 'may', 'did not')