Do LLM Evaluators Prefer Themselves for a Reason?

📝 Paper Summary

LLM-as-a-Judge Bias in automated evaluation

By testing on verifiable benchmarks like math and code, this study reveals that stronger models prefer their own outputs primarily because those outputs are objectively better, not just due to bias.

Core Problem

Prior research shows LLMs favor their own outputs during evaluation (self-preference bias), but relying on subjective tasks makes it impossible to know if this preference is a harmful error or a correct judgment of superior quality.

Why it matters:

If self-preference is purely bias, using LLMs for benchmarking or self-refinement is unreliable and unfair
Previous studies on subjective tasks (summarization, chat) lack ground truth, leaving the nature of this bias ambiguous

Concrete Example: When an LLM evaluator is asked to judge two responses—one correct one generated by itself and one incorrect one from another model—it should prefer itself. Conversely, if it generates a wrong answer but prefers it over a correct peer answer, that is harmful bias. Subjective chat benchmarks cannot distinguish these cases.

Key Novelty

Verifiable Ground-Truth Analysis of Self-Preference

Evaluates self-preference on objective tasks (Math, Code, Facts) where correctness is verifiable, allowing separation of 'legitimate' preference (preferring self because self is right) from 'harmful' preference (preferring self despite being wrong)
Systematically compares 11 evaluator models against a fixed set of 7 evaluatee models to standardize the measurement of bias across model scales

Architecture

Conceptual comparison between previous subjective studies and this paper's objective framework.

Evaluation Highlights

Task accuracy and judge accuracy are strongly correlated (r > 0.70 across MATH500, MMLU, MBPP+), confirming better generators are better evaluators
Llama-3-70B achieves 95.16% Legitimate Self-Preference Ratio (LSPR) on MATH500, indicating its self-preference is almost entirely justified by objective quality
Stronger models exhibit more harmful self-preference when they do err: incorrect strong models prefer their own wrong answers more often than weaker models do

Breakthrough Assessment

7/10

Provides a crucial nuance to the 'self-preference is bad' narrative by proving it is largely legitimate in strong models, though the method is an analysis framework rather than a new model architecture.

⚙️ Technical Details

Problem Definition

Setting: Pairwise evaluation of LLM responses with ground-truth correctness labels

Inputs: User query x, response A from model A, response B from model B

Outputs: Verdict (A is better, B is better, or Tie)

Pipeline Flow

Response Generation (Evaluator & Evaluatee generate answers)
Ground Truth Verification (Check correctness against labels)
Pairwise Judgment (Evaluator judges pairs anonymous)
Metric Calculation (SPR, LSPR)

System Modules

Response Generator

Generate answers to benchmark questions (MATH500, MMLU, MBPP+)

Model or implementation: Evaluator models (Llama, Qwen, Gemma) and Evaluatee models (Mistral, Phi, GPT)

Oracle Verifier

Determine objective correctness of responses

Model or implementation: Script-based checking (exact match, unit test execution)

LLM Judge

Evaluate pairwise preference between self and peer responses

Model or implementation: Same model as Evaluator Generator (e.g., Llama-3-70B judging itself)

Modeling

Base Model: Various: Qwen2.5 (3B-72B), Llama-3 (8B-70B), Gemma-2 (9B-27B)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Zheng et al.: Uses objective ground-truth benchmarks (Math/Code) instead of subjective chat to distinguish legitimate vs. harmful bias
vs. Panickssery et al.: Fixes the evaluatee set across all judges for consistent cross-model comparison
vs. Dubois et al. [not cited in paper]: Dubois et al. (AlpacaFarm) focus on cost/speed of evaluation; this paper focuses on validity of self-preference

Limitations

Evaluation is limited to verifiable domains (Math, Code, Facts); findings may not fully extrapolate to creative writing or nuance
Focuses on greedy decoding zero-shot generation; did not explore sampling or few-shot extensively
Relies on specific prompt templates for judging; sensitivity to prompt variations was not the primary focus

Reproducibility

Code: https://github.com/wlchen0206/llm-sp

Code and artifacts publicly available at https://github.com/wlchen0206/llm-sp. Detailed prompts provided in Appendix B. Uses standard public benchmarks (MATH500, MMLU, MBPP+) and open weights models.

📊 Experiments & Results

Evaluation Setup

Pairwise comparison of LLM outputs on verifiable tasks

Benchmarks:

MATH500 (Mathematical reasoning)
MMLU (Factual knowledge (Multiple Choice))
MBPP+ (Code generation)

Metrics:

Self-Preference Ratio (SPR)
Legitimate Self-Preference Ratio (LSPR)
Judge Accuracy
Task Accuracy
Statistical methodology: Pearson correlation coefficient (r) reported for relationship between task accuracy and judge accuracy

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis shows that better generators are generally better evaluators, and their self-preference is largely driven by quality.
MATH500	Pearson r (Task Acc vs Judge Acc)	0.0	0.795	+0.795
MBPP+	Pearson r (Task Acc vs Judge Acc)	0.0	0.899	+0.899
MATH500	LSPR (Llama-3-70B)	0.0	95.16	+95.16
MATH500	LSPR (Qwen-2.5-70B)	0.0	96.57	+96.57
Analysis of harmful bias shows stronger models struggle more to admit error when they are incorrect.
MATH500	Harmful Self-Preference Rate (when Judge is wrong)	10.0	45.0	+35.0

Main Takeaways

Self-preference in strong models is predominantly legitimate: they prefer themselves because they are right.
Harmful self-preference persists specifically when models fail as generators; stronger models are more stubborn (higher harmful bias) when they are wrong compared to weaker models.
Inference-time scaling (e.g., using Chain-of-Thought for evaluation) effectively reduces harmful self-preference bias.
Findings extend to subjective domains (LMArena), suggesting these patterns are fundamental to current LLM evaluation dynamics.

📚 Prerequisite Knowledge

Prerequisites

LLM-as-a-Judge methodology
Understanding of bias in machine learning evaluation
Familiarity with standard benchmarks (MATH, MMLU, MBPP)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Self-Preference Ratio (SPR): The proportion of cases where a judge model favors its own response over a peer's response

Legitimate Self-Preference Ratio (LSPR): The proportion of self-preference cases where the judge's own response is objectively correct and preferred

LLM-as-a-Judge: Using a Large Language Model to evaluate the quality of text generated by other models

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

Pass@1: A metric for code generation measuring the percentage of problems where the first generated solution is correct