Human Label Variation (HLV): Genuine disagreement among human annotators reflecting differences in interpretation rather than random error
Large Language Model (LLM): Advanced AI models trained on vast amounts of text, evaluated here for their ability to annotate emotions
Jensen-Shannon Divergence (JSD): A symmetric metric used to measure the similarity between two probability distributions
Lexical transparency score: A metric proposed in this paper combining embedding similarity and emotion lexicon coverage to quantify how explicitly an emotion is signaled in text
Temperature sampling: A method to control the randomness of an LLM's outputs, used here to approximate the model's soft probability distribution over emotion labels
Isotonic regression: A non-parametric calibration method that fits a monotone mapping to align predicted probabilities with actual outcomes
Valence-Arousal-Dominance (VAD): A continuous framework for representing emotions along three psychological dimensions
API models: Application Programming Interface models; commercial, closed-source models accessed via the cloud (e.g., GPT-5.4-mini, Claude Haiku 4.5)
OSS models: Open-Source Software models; publicly available models that can be run locally (e.g., Llama 3.1 8B, Qwen3-8B)