E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model

📝 Paper Summary

Multimodal Dialogue Systems Affective Computing

E3RG is a training-free modular system that explicitly conditions speech and video generation on MLLM-predicted emotions to ensure multimodal alignment and identity consistency.

Core Problem

Existing empathetic response systems rely on expensive fine-tuning and often fail to synchronize emotional cues across text, audio, and video, leading to inconsistent or disjointed avatars.

Why it matters:

Inconsistent emotional cues (e.g., sad text with a happy face) break user immersion and trust in human-computer interaction
Current end-to-end training approaches are computationally expensive and struggle to generalize to new identities without retraining
Maintaining identity consistency (voice timbre and facial appearance) across long conversations is difficult for standard generative models

Concrete Example: If a user shares a traumatic car accident story, a standard system might generate text saying 'That sounds scary' but deliver it with a neutral robot voice and a smiling face, creating a jarring, unempathetic interaction. E3RG explicitly forces the TTS and video generator to use a 'Fear/Sad' style.

Key Novelty

Explicit Emotion-Driven Modular Generation

Decomposes the complex generation task into three discrete, training-free stages: understanding (predicting emotion label), retrieval (fetching specific voice/face references), and generation (conditioning separate models on the label)
Uses the predicted emotion label as an explicit control signal for independent state-of-the-art generative models (OpenVoice, DICE-Talk) rather than relying on implicit latent space alignment

Architecture

The complete workflow of the E3RG system, illustrating the decomposition into three sub-tasks: Empathy Understanding, Memory Retrieval, and Empathy Generation.

Evaluation Highlights

Secured Top-1 position in the Avatar-based Multimodal Empathy Challenge at ACM MM'25
Achieves 76.3% HIT Rate (emotion prediction accuracy) with Ola-Omni-7B in a 3-shot setting, outperforming text-only baselines
Scored 4.03 average in human evaluation, surpassing the second-place team (3.83) in expressiveness, consistency, and naturalness

Breakthrough Assessment

7/10

Effective system integration that won a top-tier challenge. While it relies on off-the-shelf components (OpenVoice, DICE-Talk), the explicit emotion-driven control flow solves key consistency issues without training.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Empathetic Response Generation (MERG) involves generating a video response R_i = {L_i, A_i, V_i} (Linguistic, Audio, Video) given a dialogue history D_<i and current query Q_i

Inputs: Multimodal dialogue history containing text, audio, and video of the speaker and listener

Outputs: A synchronized talking-head video response conveying appropriate empathy

Pipeline Flow

Group: Multimodal Empathy Understanding (MEU): MLLM -> Emotion Prediction -> Text Response
Group: Empathy Memory Retrieval (EMR): Retrieve Profile/History -> Retrieve Cache
Group: Multimodal Response Generation (MRG): Map Emotion -> TTS -> Video Gen

System Modules

Multimodal Context Encoder (Multimodal Empathy Understanding (MEU))

Process and encode text, audio, and video inputs into a unified representation

Model or implementation: Ola-Omni-7B (best performing variant)

Emotion Predictor (Multimodal Empathy Understanding (MEU))

Predict the most likely emotion of the speaker to guide the response

Model or implementation: Ola-Omni-7B (Same LLM as encoder)

Text Response Generator (Multimodal Empathy Understanding (MEU))

Generate the linguistic content of the empathetic response

Model or implementation: Ola-Omni-7B (Same LLM)

Emotion Mapper (Multimodal Response Generation (MRG))

Map fine-grained LLM predicted emotions to the coarse emotion banks of the generative models

Model or implementation: Rule-based mapping (Emotion Wheel)

Expressive TTS (Multimodal Response Generation (MRG))

Synthesize speech with the reference timbre and predicted emotional style

Model or implementation: OpenVoice

Talking Head Generator (Multimodal Response Generation (MRG))

Generate video synchronized with audio and facial emotion

Model or implementation: DICE-Talk

Novel Architectural Elements

Explicit Emotion-Driven Control Loop: The pipeline forces generative models (TTS/Video) to strictly follow the discrete emotion label predicted by the MLLM, preventing 'emotion drift' common in end-to-end models.
Modular Interleaved Execution: Decomposes MERG into Understanding, Retrieval, and Generation phases that can run sequentially without joint training.

Modeling

Base Model: Ola-Omni-7B (Primary MLLM for understanding), OpenVoice (TTS), DICE-Talk (Video)

Compute: Not reported in the paper (Inference-only system)

Comparison to Prior Work

vs. E-CORE: E3RG is training-free and multimodal (video/audio), whereas E-CORE is fine-tuned and text-focused.
vs. EmpathyEar: E3RG uses explicit emotion labels to condition separate generative models, ensuring better alignment than implicit latent embeddings.
vs. LLaMA-Omni [not cited in paper]: E3RG focuses specifically on empathy and expressive generation using specialized sub-modules rather than a single end-to-end speech-text model.

Limitations

Dependence on base MLLM accuracy; if emotion prediction fails, the entire downstream generation (voice/face) will be emotionally incongruent.
Inference latency is likely high due to sequential execution of three large models (MLLM, TTS, Video Generator).
Limited to the pre-defined emotion banks of the generative models (OpenVoice/DICE-Talk), restricting fine-grained emotional nuance.

Reproducibility

Code: https://github.com/RH-Lin/E3RG

Code is publicly available at https://github.com/RH-Lin/E3RG. The system relies on pre-trained models (Ola-Omni, OpenVoice, DICE-Talk) and does not require training. The specific prompt templates for emotion prediction and response generation are provided in the paper.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot evaluation on multimodal dialogue data

Benchmarks:

AvaMERG (Multimodal Empathetic Response Generation)

Metrics:

HIT Rate (Emotion Prediction Accuracy)
Dist-1 (Text Diversity)
Human Evaluation (Expressiveness, Consistency, Naturalness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot and Few-shot experiments on emotion prediction (HIT Rate) demonstrate the superiority of Omni-Modal LLMs over text-only LLMs.
AvaMERG	HIT Rate	73.2	76.3	+3.1
AvaMERG	Dist-1	0.978	0.990	+0.012
Human evaluation compares E3RG against other competition teams, showing dominance in perceived quality.
AvaMERG (Test Set)	Average Score	3.83	4.03	+0.20
AvaMERG (Test Set)	Emotional Expressiveness	3.6	4.3	+0.7

Experiment Figures

Qualitative zero-shot results visualizing the same avatar generated with different emotion conditions (Neutral, Happy, Fear, Angry, etc.).

Main Takeaways

Omni-modal LLMs (like Ola-Omni) consistently outperform text-only LLMs in emotion prediction (HIT Rate) by effectively leveraging audio-visual context.
Few-shot prompting provides consistent gains over zero-shot settings across all models, validating the in-context learning capability for empathy tasks.
Explicitly driving generative models with emotion labels results in significantly higher human ratings for 'Emotional Expressiveness' compared to baselines.
The system maintains high text diversity (Dist-1 > 0.99) while ensuring emotional alignment.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with Text-to-Speech (TTS) synthesis
Basic knowledge of Talking Head Generation (audio-driven facial animation)

Key Terms

MERG: Multimodal Empathetic Response Generation—creating dialogue responses that include text, voice, and video which are emotionally aligned with the user

MLLM: Multimodal Large Language Model—an AI model capable of processing and generating both text and other modalities like images or audio

TTS: Text-to-Speech—technology that converts written text into spoken audio

Zero-shot: The ability of a model to perform a task without having explicitly trained on examples of that specific task

Few-shot: Providing a model with a small number of examples (e.g., 1 or 3) in the prompt to guide its performance

Talking Head Generation: Synthesizing a video of a face moving and speaking in synchronization with an input audio track

CoT: Chain-of-Thought—a prompting technique where the model explains its reasoning steps before giving a final answer

OpenVoice: A state-of-the-art voice cloning model that can control tone color and style independently

DICE-Talk: A generative model for creating talking head videos that disentangles identity from emotion to allow expressive control