ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

📝 Paper Summary

Video-to-Audio (V2A) Generation Multimodal Large Language Models (MLLMs) Audio Editing

ThinkSound decomposes video-to-audio generation into a three-stage pipeline (Foley generation, object refinement, editing) guided by structured Chain-of-Thought reasoning from a multimodal LLM to a unified flow-matching audio model.

Core Problem

End-to-end video-to-audio systems act as black boxes, failing to capture complex compositional nuances like synchronizing multiple events or reasoning about acoustic environments.

Why it matters:

Current models produce generic sounds that lack precise synchronization with subtle visual cues (e.g., specific object motions or environmental interactions)
Users lack fine-grained control over the generation process, unlike professional sound designers who work in iterative stages
Existing multimodal audio models often fragment tasks (generation, editing) into separate specialized models rather than using a unified reasoning-driven framework

Concrete Example: Current systems struggle to distinguish determining when an owl is chirping versus flapping its wings based on visual dynamics, often merging them into generic bird noises or failing to sync the specific sound event with the visual action.

Key Novelty

Chain-of-Thought (CoT) Driven Interactive Audio Synthesis

Decomposes the audio creation process into three explicit stages: foundational soundscape generation, click-based object refinement, and instruction-based editing
Uses a multimodal LLM to generate structured reasoning text (CoT) that describes temporal and acoustic properties before synthesis, acting as a bridge between visual inputs and audio generation
Unifies all generation and editing tasks into a single flow-matching model that can accept arbitrary combinations of video, text, and audio context

Architecture

The overall framework of ThinkSound, illustrating the MLLM fine-tuning process and the unified audio foundation model architecture.

Evaluation Highlights

Achieves state-of-the-art results on V2A benchmarks, surpassing Diff-Foley and MMAudio on objective metrics like Frechet Audio Distance (FAD)
Excels in the out-of-distribution Movie Gen Audio benchmark, demonstrating robust generalization
User studies show preference for ThinkSound's generated audio over baselines in terms of overall quality and audio-visual alignment

Breakthrough Assessment

8/10

Significant step forward in controllable V2A by successfully integrating CoT reasoning with a unified generative model. The three-stage interactive workflow closely mimics professional sound design.

⚙️ Technical Details

Problem Definition

Setting: Multimodal conditional audio generation and editing

Inputs: Video frames, optional text instructions, optional user clicks (regions of interest), optional existing audio context

Outputs: High-fidelity audio waveform synchronized with the video

Pipeline Flow

Stage 1: Foundational Foley Generation (Video → MLLM CoT → Audio Model)
Stage 2: Interactive Object-Centric Refinement (User Click + Video → MLLM CoT + Masking → Audio Model)
Stage 3: Instruction-Based Editing (Text Instruction + Audio → MLLM CoT → Audio Model)

System Modules

Reasoning Engine

Generate structured text describing audio events, timing, and acoustic properties based on inputs

Model or implementation: VideoLLaMA2 (fine-tuned)

Text Encoders

Encode text inputs for the generative model

Model or implementation: Dual pathway: MetaCLIP (for visual captions) + T5-v1-xl (for CoT reasoning)

Unified Audio Foundation Model

Generate audio latents using flow matching conditioned on multimodal inputs

Model or implementation: MM-DiT (Multimodal Diffusion Transformer) with Flow Matching

Audio Decoder

Convert generated latents back to waveform

Model or implementation: Pre-trained VAE Decoder (DAC/HiFi-GAN based)

Novel Architectural Elements

Unified Flow Matching model trained with random modality dropout to handle arbitrary combinations of Video/Text/Audio inputs
Adaptive Fusion Module that gates video features into the audio latent space to capture subtle visual dynamics
Integration of explicit CoT text embeddings (via T5) as a primary conditioning signal alongside standard captions

Modeling

Base Model: VideoLLaMA2 (for reasoning) + MM-DiT (for audio synthesis)

Training Method: Two-stage training: (1) SFT for MLLM on AudioCoT, (2) Conditional Flow Matching training for Audio Model

Objective Functions:

Purpose: Optimize MLLM to generate correct reasoning chains.

Formally: Standard Cross-Entropy Loss for next-token prediction on AudioCoT dataset.
Purpose: Train audio model to predict velocity field.

Formally: Flow Matching objective L_FM = E[||v_t(x_t) - u_t(x|x_1)||^2] where v_t is the model prediction and u_t is the target vector field.

Training Data:

AudioCoT dataset: Video-audio pairs from VGGSound and AudioSet
Audio-text pairs from AudioSet-SL, Freesound, AudioCaps, BBC Sound Effects
Structured CoT annotations generated via pipeline using Qwen2-Audio, Grounded SAM2, and GPT-4.1-nano

Key Hyperparameters:

dropout_probability_p_drop: Not explicitly reported for specific modalities, but mentioned for classifier-free guidance strategy
flow_matching_steps: Not explicitly reported

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSound-V1: ThinkSound uses a single unified foundation model for all stages vs. separate models for generation/removal
vs. MMAudio: ThinkSound integrates explicit CoT reasoning embeddings vs. direct video/text conditioning
vs. Diff-Foley: ThinkSound supports interactive object-centric refinement and instruction editing vs. end-to-end only
+ 1 more
vs. SonicVisionLM: ThinkSound preserves visual dynamics via video features vs. converting video to text captions only

Limitations

Heavy reliance on the quality of the upstream MLLM (VideoLLaMA2) and the generated CoT data
Inference latency likely higher due to the multi-stage process (MLLM inference + Audio generation)
Performance depends on the accuracy of the automated CoT annotation pipeline (Grounded SAM2, GPT-4) used to create the training data

Reproducibility

Code: https://ThinkSound-Project.github.io

Project page available at https://ThinkSound-Project.github.io. Code availability mentioned. AudioCoT dataset construction details provided in paper. Pre-trained weights for specific components (VideoLLaMA2, Qwen2-Audio) are public, but the fine-tuned ThinkSound weights' release status is not explicitly confirmed beyond the project page.

📊 Experiments & Results

Evaluation Setup

Video-to-Audio generation and editing tasks

Benchmarks:

VGGSound (In-distribution Video-to-Audio Generation)
Movie Gen Audio (Out-of-distribution / Cinematic Audio Generation)

Metrics:

Frechet Audio Distance (FAD)
KL Divergence (KLD)
Inception Score (IS)
CLAP Score (Audio-Text and Audio-Video alignment)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ThinkSound demonstrates superior audio quality and alignment on the VGGSound benchmark compared to current state-of-the-art baselines.
VGGSound	FAD	1.47	0.78	-0.69
VGGSound	CLAP-Score	0.29	0.36	+0.07
Generalization performance on the out-of-distribution Movie Gen Audio benchmark highlights the robustness of the CoT-guided approach.
Movie Gen Audio	FAD	3.28	2.15	-1.13

Experiment Figures

Comparison of ThinkSound's three-stage workflow vs. traditional end-to-end V2A.

Main Takeaways

ThinkSound consistently outperforms baselines (Diff-Foley, MMAudio) across objective metrics (FAD, CLAP) on both in-domain and out-of-domain datasets
The integration of CoT reasoning improves both the acoustic fidelity and the semantic alignment of the generated audio
The unified model architecture successfully handles diverse tasks (generation, editing) without needing separate specialized models

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with Flow Matching for generative modeling
Basic knowledge of audio signal processing (spectrograms, VAEs)

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before producing the final output

V2A: Video-to-Audio—the task of generating sound tracks that correspond to silent video inputs

Flow Matching: A generative modeling technique that learns a velocity field to transform noise into data, offering an alternative to diffusion models

MLLM: Multimodal Large Language Model—an LLM capable of processing and reasoning about non-text inputs like images and audio

Foley: The reproduction of everyday sound effects that are added to film, video, and other media in post-production

VAE: Variational Autoencoder—a neural network used here to compress audio into latent representations for efficient generation

ROI: Region of Interest—a specific area within a video frame selected (e.g., by user click) for targeted processing

CFG: Classifier-Free Guidance—a technique in generative models to control the strength of conditioning signals (like text or video) during sampling

AdaLN: Adaptive Layer Normalization—a mechanism to inject conditioning information (like time embeddings or global context) into network layers

DiT: Diffusion Transformer—a transformer-based architecture used for diffusion (or flow matching) models, replacing the traditional U-Net

FAD: Frechet Audio Distance—a metric for evaluating audio quality by comparing statistics of generated audio embeddings against real audio

CLAP: Contrastive Language-Audio Pretraining—a model used to compute similarity scores between audio and text/video for evaluation