ALM: Audio-Language Model—a multimodal model taking audio and text as input and producing text output
ASR: Automatic Speech Recognition—the task of transcribing spoken audio into text
WER: Word Error Rate—a common metric for ASR performance, measuring the proportion of errors (substitutions, deletions, insertions) in a transcript
BLEU: Bilingual Evaluation Understudy—a metric for evaluating machine translation quality by comparing n-grams to reference translations
CoRe-Bench: Conversational Reasoning Benchmark—a new synthetic dataset in this paper testing reasoning over multi-turn audio dialogues
PARADE: A new synthetic dataset in this paper designed to probe stereotyping in ALMs by associating voices with occupations or social status
AHELM: Audio Holistic Evaluation of Language Models—the benchmark framework introduced in this paper
counterfactual fairness: Evaluating if a model's output remains consistent when non-essential attributes (like speaker gender) are altered
MFCCs: Mel-frequency Cepstral Coefficients—features commonly extracted from audio signals for speech processing
Mean Win Rate: The probability that a model outperforms another model selected uniformly at random for a given metric in a head-to-head comparison