Negative-space prompting: Prompting the model to explicitly identify information it does NOT have (labeling it 'Uncertain') rather than hallucinating an answer.
Duty-Distinct: Assigning distinct roles to different models: LLM for reasoning/logic, VQA model for visual perception.
Deep-Layer Prompting (DLP): Inserting learnable prompts into multiple layers of the transformer encoder to facilitate better cross-modal alignment.
Rational-Compressed Visual Embedding (RCVE): Using the generated text rationale to attend to and filter visual features before feeding them into the language model.
Hallucination: When a model generates factually incorrect information or details not present in the source input (e.g., describing objects not in an image).
VQA: Visual Question Answering—a task where a system answers a natural language question about an image.
ScienceQA: A multimodal benchmark dataset consisting of science questions with images and explanations.
UnifiedQA: A T5-based language model fine-tuned on multiple QA datasets, used here as the base model for fine-tuning experiments.