AHELM: A Holistic Evaluation of Audio-Language Models

📝 Paper Summary

Multimodal Evaluation Audio-Language Models

AHELM is a holistic benchmark for audio-language models that standardizes evaluation across 10 distinct aspects, revealing that while top models excel in reasoning, they still exhibit significant fairness and instruction-following issues.

Core Problem

Current evaluations of Audio-Language Models (ALMs) lack standardization, typically focusing on only one or two capabilities (like ASR) while neglecting critical societal aspects like fairness, safety, and bias.

Why it matters:

Comparisons across models are difficult because separate evaluations use different prompting methods and inference parameters
Existing benchmarks omit evaluative aspects such as fairness or safety, which are critical for widespread deployment
Raw predictions are often not released, making detailed error analysis and reproducibility impossible

Concrete Example: When prompted to 'respond with only the transcript text', Qwen2-Audio Instruct fails to follow instructions and outputs conversational filler like 'The speech is in English, saying [transcript]', complicating automated evaluation.

Key Novelty

Holistic Evaluation of Audio-Language Models (AHELM)

Aggregates 14 datasets into a unified framework covering 10 diverse aspects (e.g., audio perception, reasoning, fairness, toxicity) to move beyond simple ASR metrics
Introduces two novel synthetic datasets: PARADE (for measuring stereotype bias in audio) and CoRe-Bench (for multi-turn conversational reasoning)
Standardizes inference parameters (temperature=0) and prompts across 14 ALMs and 3 baseline systems to ensure equitable comparison

Architecture

The AHELM evaluation framework components and flow

Evaluation Highlights

Gemini 2.5 Pro (05-06 Preview) ranks top in 5 out of 10 aspects with a mean win rate of 0.803, but exhibits statistically significant group unfairness on ASR tasks
Baseline systems (ASR + LLM) perform surprisingly well, with GPT-4o-mini Transcribe + GPT-4o ranking 6th overall, outperforming 9 integrated ALMs
French and Indonesian languages achieve the highest toxicity detection scores (Exact Match ~0.956) compared to other languages in the MuToX scenario

Breakthrough Assessment

9/10

Establishes the first comprehensive standard for ALM evaluation, introducing critical new datasets for reasoning and bias. The inclusion of strong ASR+LLM baselines provides a necessary reality check for the field.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of multimodal models that process interleaved audio and text to generate text outputs

Inputs: Interleaved audio files and text prompts

Outputs: Text completions (transcripts, answers to questions, captions, etc.)

Pipeline Flow

Input (Audio + Text Prompt)
Model Inference (Zero-shot)
Metric Calculation (Automated or Model-based Judge)

System Modules

Input Processing

Prepare standardized zero-shot prompts and audio files for the model

Model or implementation: Various (14 ALMs + 3 Baselines)

Model Inference

Generate text response based on audio-text input

Model or implementation: Evaluated Model (e.g., Gemini 2.5 Pro, Qwen2-Audio)

Evaluation

Score the output using deterministic metrics or an LLM judge

Model or implementation: Script or GPT-4o (Judge)

Novel Architectural Elements

Integration of baseline systems composed of dedicated ASR (Whisper/GPT-4o) chained with an LLM (GPT-4o) to benchmark against end-to-end ALMs

Modeling

Base Model: Various (14 ALMs evaluated, including Gemini, GPT-4o Audio, Qwen2-Audio)

Training Method: Zero-shot evaluation only

Key Hyperparameters:

temperature: 0
max_output_tokens: 200

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard ASR Benchmarks (LibriSpeech): AHELM includes 10 aspects including fairness, bias, and reasoning, not just transcription accuracy
vs. AIR-Bench: AHELM introduces CoRe-Bench for long-context conversational reasoning and PARADE for bias detection
vs. Dynamic-SUPERB [not cited in paper]: AHELM standardizes prompts and inference parameters across closed and open models to ensure fair comparison

Limitations

Reliance on GPT-4o as a judge for open-ended tasks may introduce its own biases
Evaluation is limited to zero-shot settings, potentially underestimating model capabilities with few-shot prompting
Did not evaluate recently released models that might have appeared after June 1, 2025

Reproducibility

Code: https://github.com/stanford-crfm/helm

publicly available (https://github.com/stanford-crfm/helm). All raw prompts, model generations, and outputs are available on the project website. Code for the framework is on GitHub.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across 10 aspects using 14 datasets (12 existing + 2 new)

Benchmarks:

CoRe-Bench (Multi-turn conversational audio reasoning) [New]
PARADE (Stereotype and bias detection in audio) [New]
LibriSpeech (Automatic Speech Recognition (ASR))
MELD (Emotion detection)
FLEURS (Multilingual ASR and Fairness evaluation)
MuToX (Toxicity detection)

Metrics:

Word Error Rate (WER)
Accuracy
Mean Win Rate (MWR)
BLEU score
Exact Match
Statistical methodology: Paired t-test for fairness performance disparity (p-values reported)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance rankings show Gemini 2.5 Pro leading, but baseline systems (ASR+LLM) remain competitive.
All Scenarios	Mean Win Rate	Rank 6 overall	Rank 1 (Gemini 2.5 Pro)	Top Rank
ASR Tasks	Group Unfairness (p-value)	Not significant	0.02	Significant
Robust Speech Bench	WER	0.039	0.039	0.0
Emotion Scenarios	Mean Win Rate	Rank 2 (tied)	0.781	Rank 1
MuToX	Mean Accuracy (French/Indonesian)	Lower	0.956	High

Main Takeaways

No single model excels across all 10 aspects; high performance in reasoning does not guarantee fairness or robustness
Baseline systems (ASR + LLM) are highly competitive, particularly in robustness and tasks where text is a good abstraction (like simple emotion detection), but fail in non-speech tasks like music identification
Open-weight models (like Qwen2-Audio) struggle significantly with instruction following compared to closed models, often outputting conversational filler instead of raw labels
Emotion detection results suggest that for some datasets (MELD), text content is sufficient, while for others (MUStARD sarcasm), audio features are critical

📚 Prerequisite Knowledge

Prerequisites

Understanding of Automatic Speech Recognition (ASR) metrics
Familiarity with multimodal Large Language Models
Basic knowledge of evaluation metrics like WER, BLEU, and Exact Match

Key Terms

ALM: Audio-Language Model—a multimodal model taking audio and text as input and producing text output

ASR: Automatic Speech Recognition—the task of transcribing spoken audio into text

WER: Word Error Rate—a common metric for ASR performance, measuring the proportion of errors (substitutions, deletions, insertions) in a transcript

BLEU: Bilingual Evaluation Understudy—a metric for evaluating machine translation quality by comparing n-grams to reference translations

CoRe-Bench: Conversational Reasoning Benchmark—a new synthetic dataset in this paper testing reasoning over multi-turn audio dialogues

PARADE: A new synthetic dataset in this paper designed to probe stereotyping in ALMs by associating voices with occupations or social status

AHELM: Audio Holistic Evaluation of Language Models—the benchmark framework introduced in this paper

counterfactual fairness: Evaluating if a model's output remains consistent when non-essential attributes (like speaker gender) are altered

MFCCs: Mel-frequency Cepstral Coefficients—features commonly extracted from audio signals for speech processing

Mean Win Rate: The probability that a model outperforms another model selected uniformly at random for a given metric in a head-to-head comparison