MLLM: Multimodal Large Language Model—AI models capable of processing and reasoning over both text and image inputs
MSA: Multimodal Sentiment Analysis—detecting sentiment (positive/negative/neutral) from text-image pairs
MABSA: Multimodal Aspect-Based Sentiment Analysis—identifying sentiment toward specific aspects/entities within multimodal content
MSR: Multimodal Sarcasm Recognition—detecting sarcasm that often arises from the contradiction between text and image
MHMR: Multimodal Hateful Memes Recognition—identifying hate speech in memes where meaning depends on text-image context
VQA: Visual Question Answering—answering questions based on visual content
MRE: Multimodal Relation Extraction—identifying relationships between entities in a text-image pair
Encoder-Decoder: A neural architecture (like T5) that encodes input into a representation before decoding it, often used for sequence-to-sequence tasks
Decoder-only: A neural architecture (like GPT or LLaMA) that predicts the next token based on history, common in generative LLMs