Evaluation Setup
Question Answering on the QuALITY dataset (Reading Comprehension) without access to source documents at test time (Closed-book)
Benchmarks:
- QuALITY (Multiple-choice Question Answering)
Metrics:
- QA Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| QuALITY processing |
Token Count |
1.3M |
455M |
+453.7M
|
| QuALITY |
Relative Recovery of RAG Performance |
100 |
80 |
-20
|
Main Takeaways
- Simple paraphrasing saturates quickly; adding more paraphrased tokens yields diminishing returns compared to the log-linear scaling of EntiGraph.
- The knowledge acquired via Synthetic CPT is complementary to RAG; combining EntiGraph-trained models with RAG yields better performance than RAG with a base model.
- The method effectively converts compute (generation of synthetic data) into data efficiency (learning from small corpora).