Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

📝 Paper Summary

LLM Auditing Recommender Systems Fairness and Bias

LLMScholarBench benchmarks 22 LLMs to demonstrate that user interventions like RAG and constrained prompting merely redistribute errors between factual accuracy and social diversity rather than solving them.

Core Problem

Existing audits evaluate LLM scholar recommendations in isolation, ignoring how common user interventions (temperature, prompting constraints, RAG) radically alter model behavior and failure modes.

Why it matters:

Static audits fail to predict performance in deployed systems where users actively steer models
Biased recommendations reinforce the 'Matthew effect,' invisibilizing qualified scholars from underrepresented groups
Users need to know if 'fixing' diversity via prompts accidentally breaks factual validity (hallucinations)

Concrete Example: A user prompts for 'top physics experts' and adds a constraint for 'diverse candidates.' The audit reveals this intervention often causes the model to hallucinate non-existent scholars to satisfy the diversity requirement, improving representation metrics at the cost of factuality.

Key Novelty

Intervention-Based Auditing Framework

Systematically evaluates not just base models, but the interaction between models and post-training interventions (temperature, RAG, prompting constraints)
Separates evaluation metrics into 'Technical Quality' (validity, factuality) and 'Social Representation' (diversity, parity) to explicitly measure trade-offs between them

Architecture

The LLMScholarBench auditing framework, illustrating the flow from infrastructure choices and user interventions to task execution and dual-axis evaluation.

Breakthrough Assessment

8/10

Significant shift from static model auditing to dynamic intervention auditing. Highlights critical trade-offs (diversity vs. factuality) often missed in standard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Generative scholar recommendation given a natural language prompt and optional inference constraints

Inputs: Prompt P specifying task (e.g., top-k, field), constraints C (e.g., diversity), and temperature t

Outputs: List of recommended scholar names L

Pipeline Flow

Configuration: Select Model + Task + Intervention (Temp/Prompt/RAG)
Inference: Generate scholar list via LLM
Parsing: Extract names and validate format
Evaluation: Verify against APS ground truth

System Modules

Prompt Generator

Constructs zero-shot prompts with step-by-step instructions and specific task constraints

Model or implementation: N/A (Template-based)

Model Inference

Generates recommendations based on the input prompt and temperature settings

Model or implementation: 22 LLMs (Gemini, Llama, GPT, etc.)

Evaluator

Parses output and computes metrics against ground truth data

Model or implementation: Rule-based matching

Novel Architectural Elements

Integration of end-user inference interventions (Temperature, RAG, Prompt Constraints) directly into the benchmarking loop as independent variables

Modeling

Base Model: 22 distinct LLMs evaluated (including GPT-4, Gemini, Claude, Llama 3, Mistral)

Compute: Not reported in the paper (Inference-only audit using API credits)

Limitations

Relies on inferred demographic attributes (name-based gender/ethnicity) rather than self-identification
Ground truth is limited to Physics (APS data), potentially limiting generalization to other fields
Evaluation period is limited to one month, potentially missing long-term temporal shifts in model behavior

Reproducibility

Code and data mentioned as released (Barolo and Espín-Noboa, 2026). Ground truth relies on APS data (proprietary but accessible for research) and OpenAlex (open).

📊 Experiments & Results

Evaluation Setup

Scholar recommendation across 5 task families (Top-k, Field, Epoch, Seniority, Twin) verified against APS physics data

Benchmarks:

LLMScholarBench (Expert Recommendation / Information Retrieval) [New]

Metrics:

Factual Accuracy (proportion of real scholars)
Refusal Rate (rate of declining to answer)
Diversity (entropy over demographic categories)
Parity (alignment with population demographics)
Validity (production of parseable lists)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Interventions redistribute rather than reduce error: Improving social representation often degrades technical quality (factuality/validity)
Higher temperature increases diversity but significantly degrades validity, consistency, and factuality (hallucinations increase)
Representation-constrained prompting (explicitly asking for diversity) succeeds in diversifying lists but at the expense of factual accuracy
RAG (Web Search) primarily improves technical quality (factuality) but reduces diversity and parity, reinforcing the visibility of already prominent scholars
Reasoning models and standard models react differently to constraints, but no single configuration optimizes all dimensions simultaneously

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and inference parameters
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of bibliometrics (citations, h-index) and academic publishing

Key Terms

RAG: Retrieval-Augmented Generation—enhancing LLM outputs by retrieving relevant data from external sources (here, web search) during inference

Hallucination: The generation of factually incorrect or non-existent information (e.g., inventing fake scholars)

Temperature: A hyperparameter controlling the randomness of LLM predictions; higher values increase diversity but risk incoherence

APS: American Physical Society—source of the ground-truth publication data used to verify scholar existence and expertise

Representation-constrained prompting: Modifying the input prompt to explicitly request recommendations satisfying specific demographic or attribute-based criteria (e.g., 'include more women')

Matthew effect: The phenomenon where established scientists get disproportionately more credit than unknown ones ('the rich get richer')

OpenAlex: A catalog of scholarly works used here to augment APS data with global bibliometric indicators and name resolution