DPO: Direct Preference Optimization—a method to align models to preferences without a separate reward model.
Omni-LLM: Large Language Models capable of processing and reasoning over text, audio, image, and video simultaneously.
Cross-modal hallucination: When a model perceives entities in one modality (e.g., audio) solely because they appear in another (e.g., video), without actual evidence.
Language Prior Debiasing (LPD): A penalty term introduced to suppress model responses that are driven purely by the language model's internal statistical priors rather than sensory input.
Invariance: The property where the model's prediction remains stable even if the irrelevant modality (e.g., audio for a visual question) is corrupted.
Sensitivity: The property where the model's prediction changes drastically if the relevant modality (e.g., video for a visual question) is corrupted.