Evaluation Setup
Scholar recommendation across 5 task families (Top-k, Field, Epoch, Seniority, Twin) verified against APS physics data
Benchmarks:
- LLMScholarBench (Expert Recommendation / Information Retrieval) [New]
Metrics:
- Factual Accuracy (proportion of real scholars)
- Refusal Rate (rate of declining to answer)
- Diversity (entropy over demographic categories)
- Parity (alignment with population demographics)
- Validity (production of parseable lists)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Interventions redistribute rather than reduce error: Improving social representation often degrades technical quality (factuality/validity)
- Higher temperature increases diversity but significantly degrades validity, consistency, and factuality (hallucinations increase)
- Representation-constrained prompting (explicitly asking for diversity) succeeds in diversifying lists but at the expense of factual accuracy
- RAG (Web Search) primarily improves technical quality (factuality) but reduces diversity and parity, reinforcing the visibility of already prominent scholars
- Reasoning models and standard models react differently to constraints, but no single configuration optimizes all dimensions simultaneously