LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

📝 Paper Summary

LLM Evaluation Emotion Analysis Model Calibration

Large language models struggle to capture human annotation disagreement in emotion analysis without supervised fine-tuning, but their distributional alignment can be improved through lightweight post-hoc calibration.

Core Problem

Human annotators frequently disagree on emotion labels due to genuine interpretive differences, but standard Large Language Model (LLM) evaluations collapse these distributions into a single majority label, obscuring whether models actually capture human uncertainty.

Why it matters:

Discarding annotator disagreement masks critical differences in emotional perception, cultural background, and text ambiguity
Prior work evaluating LLMs as annotators primarily tests majority-vote accuracy, ignoring the structural nuances of Human Label Variation (HLV)

Concrete Example: A sarcastic Reddit comment may be judged as amusement, annoyance, and neutral by different human annotators. Zero-shot LLMs systematically fail to capture this uncertainty spread, instead collapsing predictions toward specific biases like negative emotions.

Key Novelty

Distributional evaluation and lexical transparency profiling for LLM emotion annotation

Treats both human judgments (from multiple annotators) and LLM outputs (via temperature sampling) as full probability distributions for direct structural comparison
Introduces a lexical transparency score that predicts which emotion categories an LLM can reliably annotate based on explicit lexical markers

Architecture

The overall evaluation pipeline comparing human annotations and LLM outputs as probability distributions over emotion categories.

Evaluation Highlights

Zero-shot LLMs show substantial divergence from human distributions, with Jensen-Shannon Divergence (JSD) greater than 0.45 on the GoEmotions benchmark
In-domain fine-tuned RoBERTa achieves roughly half the distributional gap of the best zero-shot LLMs
Post-hoc calibration via isotonic regression reduces the human-LLM distributional gap by up to 14% across models

Breakthrough Assessment

7/10

Provides a strong framework for evaluating LLM uncertainty against human disagreement, verifying that current zero-shot models fundamentally lack human-like nuance while offering practical calibration fixes.

⚙️ Technical Details

Problem Definition

Setting: Distributional comparison of emotion annotation between human annotator variance and LLM output variance across categorical and continuous frameworks

Inputs: Natural language text sequences (e.g., Reddit comments or sentences)

Outputs: Probability distributions over 28 emotion categories, or continuous Valence-Arousal-Dominance (VAD) ratings

Pipeline Flow

LLM Sampler (generates multiple responses via temperature scaling) → Distribution Aggregator (constructs empirical distribution) → Post-hoc Calibrator (aligns with human data)

System Modules

LLM Sampler

Generates multiple emotion annotations for a text using different temperatures to build an output distribution

Model or implementation: GPT-5.4-mini, Claude Haiku 4.5, Llama 3.1 8B, or Qwen3-8B

Distribution Aggregator

Aggregates independent categorical choices or continuous ratings into a normalized probability distribution over emotion labels

Model or implementation: Deterministic counting function

Post-hoc Calibrator

Reduces the distributional gap between human and LLM outputs using methods like isotonic regression, temperature scaling, or bias correction

Model or implementation: Statistical calibration fit via 5-fold cross-validation

Modeling

Base Model: GPT-5.4-mini, Claude Haiku 4.5, Llama 3.1 8B, Qwen3-8B, and RoBERTa-base

Comparison to Prior Work

vs. Ni et al. (2026): Analyzes the full 28-category emotion distribution instead of reducing it to binary classification subtasks, and actively proposes calibration methods
vs. Falk and Lapesa (2025): Focuses strictly on zero-shot LLMs rather than fine-tuned models, and formalizes a predictive lexical transparency score
vs. Jury Learning [not cited in paper]: Focuses on zero-shot distributional evaluation and post-hoc calibration rather than supervised training of dissenting voices

Limitations

Approximates LLM uncertainty using temperature sampling, which modulates output entropy rather than reflecting genuine interpretive differences like human disagreement
Categorical emotion analysis relies on categories with varying sample sizes; low-frequency emotions (e.g., grief, relief) yield less reliable statistical correlations
Proposed calibration methods require a labeled validation set of human distributions, limiting pure zero-shot applicability

Reproducibility

No replication artifacts are explicitly mentioned as publicly available. Datasets (GoEmotions, EmoBank) are standard and public. Exact prompts are in appendices A and B (not fully detailed in the excerpt). Uses closed-source commercial APIs (GPT-5.4-mini, Claude Haiku 4.5) for half the evaluation.

📊 Experiments & Results

Evaluation Setup

Distributional comparison of emotion annotations treating both human judgments and Large Language Model (LLM) outputs as probability distributions

Benchmarks:

GoEmotions (Core Set) (Categorical emotion distribution (28 categories))
EmoBank (Core Set) (Continuous Valence-Arousal-Dominance (VAD) ratings)

Metrics:

Jensen-Shannon Divergence (JSD)
KL divergence
Wasserstein distance
Spearman's rho (Entropy correlation)
Mean Absolute Error (MAE)
Pearson correlation
Statistical methodology: Kruskal-Wallis H-tests with Dunn's post-hoc correction, bootstrap resampling (1,000 iterations) for 95% confidence intervals, paired Wilcoxon signed-rank tests, Cohen's d, and Cliff's delta.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GoEmotions	Jensen-Shannon Divergence (JSD)	0.573	0.519	-0.054

Experiment Figures

Qualitative bias profiles showing the mean rate difference per emotion category between human distributions and various zero-shot LLMs.

Jensen-Shannon Divergence (JSD) scores stratified by human agreement level (full agreement, partial agreement, full disagreement).

Main Takeaways

Zero-shot LLMs systematically diverge from human emotion distributions (JSD >= 0.45), highlighting that current models capture majority labels but fail to model human annotation uncertainty
In-domain fine-tuning (RoBERTa) dramatically reduces distributional divergence (e.g., JSD 0.10 on full-agreement texts compared to 0.31-0.56 for zero-shot models), proving supervised training primarily improves unambiguous cases
A novel lexical transparency score strongly correlates with human-LLM agreement (rs=0.51); LLMs perform well on explicitly marked emotions (love, gratitude) but struggle systematically with pragmatically complex ones (approval, realization)
Open-source models exhibit 2-3x greater temperature sensitivity than API models, meaning higher sampling diversity is required for OSS models to recover uncertainty signals
Post-hoc calibration via isotonic regression significantly reduces the per-text distributional gap for GPT, Claude, and Llama models, improving 62-64% of texts evaluated

📚 Prerequisite Knowledge

Prerequisites

Probability distributions
Zero-shot prompting of Large Language Models
Emotion classification frameworks (categorical vs continuous)

Key Terms

Human Label Variation (HLV): Genuine disagreement among human annotators reflecting differences in interpretation rather than random error

Large Language Model (LLM): Advanced AI models trained on vast amounts of text, evaluated here for their ability to annotate emotions

Jensen-Shannon Divergence (JSD): A symmetric metric used to measure the similarity between two probability distributions

Lexical transparency score: A metric proposed in this paper combining embedding similarity and emotion lexicon coverage to quantify how explicitly an emotion is signaled in text

Temperature sampling: A method to control the randomness of an LLM's outputs, used here to approximate the model's soft probability distribution over emotion labels

Isotonic regression: A non-parametric calibration method that fits a monotone mapping to align predicted probabilities with actual outcomes

Valence-Arousal-Dominance (VAD): A continuous framework for representing emotions along three psychological dimensions

API models: Application Programming Interface models; commercial, closed-source models accessed via the cloud (e.g., GPT-5.4-mini, Claude Haiku 4.5)

OSS models: Open-Source Software models; publicly available models that can be run locally (e.g., Llama 3.1 8B, Qwen3-8B)