MM-AU:Towards Multimodal Understanding of Advertisement Videos

📝 Paper Summary

Video Understanding Computational Media Understanding Affective Computing

MM-AU introduces a large-scale multimodal dataset for advertisement understanding and demonstrates that fusing audio, visual, and text modalities improves performance on topic, tone, and social message detection.

Core Problem

Current media understanding datasets largely focus on movies or generic videos, failing to capture the condensed narrative structures, rapid tone transitions, and persuasive social messaging unique to advertisements.

Why it matters:

Advertisements are a primary medium for social influence and product promotion, requiring distinct analysis from feature-length content
Understanding ads requires modeling fine-grained transitions (e.g., negative-to-positive narrative arcs) which static classification misses
Existing ad datasets are often limited to images or lack multimodal annotations for complex reasoning tasks

Concrete Example: An ad might start with a negative tone (sad music, suffering characters) to highlight a problem like pollution, then shift to a positive tone (upbeat music, solution) to promote a brand. A standard sentiment classifier averaging the whole video would miss this crucial narrative arc.

Key Novelty

MM-AU Benchmark & Tone Transition Task

Introduction of 'Tone Transition' as a formal task: tracking the affective shift (Start vs. Middle vs. End) within a condensed video narrative
Curating a multilingual dataset of 8.4K videos with expert annotations for social message presence and topic categorization across 18 classes
A two-stage multimodal fusion approach using PerceiverIO to combine Audio-Visual and Text-Visual signals for high-level semantic reasoning

Architecture

The two-stage multimodal fusion pipeline used for the benchmark tasks

Evaluation Highlights

Proposed A-Max multimodal fusion achieves 65.92% accuracy on Topic Categorization, doubling the performance of zero-shot GPT-4 (33.29%)
Text-Visual fusion (TxTV) proves most effective for Social Message detection (74.03% F1), significantly outperforming Audio-Visual methods (70.05% F1)
Audio-Visual fusion is superior for Tone Transition detection (63.72% F1) compared to Text-Visual methods, highlighting the role of soundscapes in affect

Breakthrough Assessment

7/10

Strong contribution in dataset curation and defining the novel task of tone transition in ads. The modeling approach is a standard application of transformers, but the benchmark enables new research directions.

⚙️ Technical Details

Problem Definition

Setting: Multimodal classification of video advertisements into topics, tone transitions, and social message presence

Inputs: Video sequence v (shots), Audio waveform a, and Text transcript t

Outputs: Topic label (1 of 18), Tone Transition (Binary: 0/1), Social Message Presence (Binary: 0/1)

Pipeline Flow

Feature Extraction (Audio, Visual, Text)
Stage 1: Modality Pair Encoding (TxTV and TxAV Transformers)
Stage 2: Logit Fusion (A-Max/D-Max)

System Modules

Feature Extractors

Convert raw modalities into dense vector representations

Model or implementation: Visual: CLIP (ViT-B/32); Audio: AST; Text: BERT

TxTV Encoder (Encoding)

Fuse Text and Visual features using cross-attention

Model or implementation: PerceiverIO (4 layers, 16 latents, 8 heads)

TxAV Encoder (Encoding)

Fuse Audio and Visual features using cross-attention

Model or implementation: PerceiverIO (4 layers, 16 latents, 8 heads)

Fusion Layer

Combine predictions from the two encoders

Model or implementation: A-Max (Average then Argmax) or D-Max (Dual-Max)

Modeling

Base Model: PerceiverIO (custom lightweight config: 4 layers, 256 hidden dim)

Training Method: Supervised learning with Binary/Multi-class Cross Entropy

Objective Functions:

Purpose: Minimize classification error for social message/tone/topic.

Formally: Standard Cross-Entropy Loss.

Adaptation: Two-stage training: Full finetuning of encoders (Stage 1), then frozen fusion (Stage 2)

Trainable Parameters: PerceiverIO weights (Audio/Visual/Text encoders are pretrained/frozen feature extractors)

Training Data:

5877 Train, 830 Val, 1692 Test (70/10/20 split)
8399 total videos from Ads of the World, Cannes Lion, Video-Ads dataset

Key Hyperparameters:

learning_rate: 1e-4 or 1e-5
batch_size: 16
optimizer: Adam or AdamW
+ 3 more
max_seq_len_visual: 35
max_seq_len_audio: 14
max_seq_len_text: 256 or 512

Compute: 4x 2080ti GPUs

Comparison to Prior Work

vs. Video-Ads: MM-AU includes 'Tone Transition' and 'Social Message' tasks, not just topic/sentiment
vs. Zero-shot LLMs (GPT-4): MM-AU demonstrates that supervised multimodal models significantly outperform large text-only models on domain-specific ad understanding
vs. Unimodal Baselines: MM-AU introduces a specific two-stage PerceiverIO fusion to leverage complementary modalities

Limitations

Reliance on English translations (via GPT-4) for multilingual transcripts may introduce errors
Tone transition is simplified to binary (Transition/No Transition) rather than predicting the exact sequence (e.g., Neg->Pos)
Zero-shot baselines are text-only; no multimodal zero-shot (e.g., GPT-4V) comparisons included

Reproducibility

Not provided: The paper does not provide a URL for the code or the dataset download, though it describes the dataset curation process in detail. Pretrained models (CLIP, BERT, AST, Whisper) are public.

📊 Experiments & Results

Evaluation Setup

Supervised classification on the MM-AU test set

Benchmarks:

MM-AU Social Message (Binary Classification) [New]
MM-AU Tone Transition (Binary Classification) [New]
MM-AU Topic Categorization (Multi-class Classification (18 classes)) [New]

Metrics:

Accuracy
Macro-F1
Statistical methodology: Standard deviation reported over 5 runs with random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of zero-shot LLM reasoning vs supervised multimodal baselines. GPT-4 performs best among zero-shot models but lags behind supervised approaches.
MM-AU Social Message	Macro-F1	39.61	65.66	+26.05
MM-AU Topic Categorization	Accuracy	5.71	33.29	+27.58
Multimodal fusion results showing the benefit of combining Text, Audio, and Video over unimodal baselines.
MM-AU Social Message	Macro-F1	72.28	74.03	+1.75
MM-AU Tone Transition	Macro-F1	61.65	64.67	+3.02
MM-AU Topic Categorization	Accuracy	61.30	65.92	+4.62

Experiment Figures

Distribution of perceived tone labels (Positive, Neutral, Negative) across the Start, Middle, and End segments of the ad videos

Main Takeaways

Supervised multimodal models consistently outperform zero-shot LLMs (GPT-4), particularly in topic categorization where visual context is essential
Text transcripts are the most discriminative modality for Social Message detection, while Audio is crucial for detecting Tone Transitions
The proposed two-stage fusion strategies (A-Max, D-Max) effectively combine complementary signals from Audio-Visual and Text-Visual pairs, yielding the best overall performance

📚 Prerequisite Knowledge

Prerequisites

Multimodal learning (fusion strategies)
Transformer architectures (PerceiverIO, BERT, ViT)
Affective computing concepts (Tone, Sentiment)

Key Terms

PerceiverIO: A transformer architecture that maps diverse inputs to a fixed-size latent space, allowing efficient handling of high-dimensional multimodal data

CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images with text descriptions, used here for visual feature extraction

AST: Audio Spectrogram Transformer—a model adapted from vision transformers to process audio spectrograms for classification

Macro-F1: A metric that calculates F1 score for each class independently and then averages them, treating all classes equally regardless of size

Zero-shot reasoning: Using a pre-trained model (like GPT-4) to perform a task without any specific training examples

Tone Transition: A binary classification task determining if the perceived affective tone changes between the start, middle, and end of a video

A-Max: A fusion strategy where the logits (predictions) from two separate models are averaged, and the maximum value determines the class

D-Max: A fusion strategy where the maximum logit value across two models is selected directly to determine the class