MERG: Multimodal Empathetic Response Generation—creating dialogue responses that include text, voice, and video which are emotionally aligned with the user
MLLM: Multimodal Large Language Model—an AI model capable of processing and generating both text and other modalities like images or audio
TTS: Text-to-Speech—technology that converts written text into spoken audio
Zero-shot: The ability of a model to perform a task without having explicitly trained on examples of that specific task
Few-shot: Providing a model with a small number of examples (e.g., 1 or 3) in the prompt to guide its performance
Talking Head Generation: Synthesizing a video of a face moving and speaking in synchronization with an input audio track
CoT: Chain-of-Thought—a prompting technique where the model explains its reasoning steps before giving a final answer
OpenVoice: A state-of-the-art voice cloning model that can control tone color and style independently
DICE-Talk: A generative model for creating talking head videos that disentangles identity from emotion to allow expressive control