Evaluation Setup
QA over narrative documents requiring temporal and causal reasoning
Benchmarks:
- ChronoQA (Temporal/Causal Narrative QA) [New]
Metrics:
- Accuracy
- Temporal Consistency Score
- Causal Consistency Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| E2RAG outperforms baselines on the newly constructed ChronoQA benchmark, particularly on questions requiring temporal reasoning. |
| ChronoQA |
Accuracy |
Not reported in the paper |
Not reported in the paper |
Positive (Qualitative)
|
Main Takeaways
- Standard KG-RAG fails on narratives because it collapses time; E2RAG's event graph preserves it.
- Unstructured RAG lacks the mechanism to reason about 'before' and 'after' relationships in complex stories.
- Separating entities and events into dual graphs allows for more precise retrieval of character states at specific time points.