Evaluation Setup
Synthetic question generation from two corpora (Medical and General Knowledge)
Benchmarks:
- CORD-19 (Domain-specific Q&A generation (Medical))
- Wikipedia (NQ subset) (General knowledge Q&A generation)
Metrics:
- Lexical diversity
- Syntactic diversity
- Semantic diversity
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- DataMorgana produces questions with higher diversity across lexical, syntactic, and semantic dimensions compared to Vanilla, Know Your RAG, and DeepEval baselines.
- The tool successfully adapts to domain-specific requirements (e.g., defining 'Patients' vs 'Doctors' for CORD-19) via simple configuration changes.
- Manual annotation confirms high fidelity (relevance/correctness) of individual questions, validating that increased diversity does not come at the cost of quality.