T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering

📝 Paper Summary

Multimodal Reasoning Chain-of-Thought (CoT) Prompting Knowledge Distillation

T-SciQ improves multimodal science question answering by training a small student model on a mix of simple Chain-of-Thought and complex Plan-based Chain-of-Thought rationales generated by a Large Language Model.

Core Problem

Existing Multimodal-CoT methods rely on human-annotated rationales, which are costly to collect and often lack essential external information or accuracy due to limited annotator expertise.

Why it matters:

Human annotation for complex scientific reasoning is expensive and time-consuming
Annotators often miss external knowledge required for correct reasoning
Captioning-based approaches lose visual information in complex images

Concrete Example: In a science question about animal classification (Figure 1), a human annotator might provide a simple rationale missing the specific biological definition, whereas an LLM can generate a detailed explanation involving external knowledge. Furthermore, simple CoT fails on complex multi-step problems where planning is needed.

Key Novelty

T-SciQ (Teaching Science Question Answering)

Generates two types of teaching data from an LLM: standard CoT for simple problems and Plan-based CoT (PCoT) for complex problems requiring decomposition
Uses a data mixing strategy governed by validation set performance to assign the optimal teaching rationale type (CoT vs. PCoT) to each skill category
Trains a smaller student model using a two-stage framework (rationale generation then answer inference) with these mixed synthetic signals

Architecture

The T-SciQ framework pipeline, including Teaching Data Generation, Data Mixing, and Two-stage Fine-tuning.

Evaluation Highlights

Achieves 96.18% accuracy on ScienceQA, setting a new state-of-the-art
Outperforms the best GPT-4 based few-shot baseline by 9.64%
Surpasses human performance (88.40%) by 7.78%

Breakthrough Assessment

9/10

Significant leap in performance on a major benchmark (ScienceQA), surpassing GPT-4 and human baselines by large margins using a smaller distilled model.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Science Question Answering

Inputs: A problem P_i consisting of language input X_{i,la} (question, context, options) and visual input X_{i,v} (image)

Outputs: A rationale T_i and a final answer A_i

Pipeline Flow

Data Generation (Teacher): Generate QA-CoT and QA-PCoT samples using LLM
Data Mixing: Select best rationale type per skill via validation set
Stage 1 Training: Fine-tune student to generate rationale from question+image
Stage 2 Training: Fine-tune student to infer answer from question+image+rationale

System Modules

SciTeacher (Data Generator)

Generate synthetic rationales (CoT and PCoT) for training data

Model or implementation: Large Language Model (implied GPT-3.5/4 class, referred to as SciTeacher)

Rationale Generator (Student Model Inference)

Generate the reasoning explanation for a test question

Model or implementation: Multimodal-CoT architecture (Transformer + ViT + Gated Fusion)

Answer Inferer (Student Model Inference)

Predict the final answer option based on the question and generated rationale

Model or implementation: Multimodal-CoT architecture (Transformer + ViT + Gated Fusion)

Novel Architectural Elements

Data mixing strategy: Dynamically selecting between CoT and PCoT training signals for the student model based on validation set error rates per skill capability

Modeling

Base Model: Multimodal-CoT (based on UnifiedQA-Base and DETR-ResNet50)

Training Method: Supervised Fine-Tuning (Two-stage)

Objective Functions:

Purpose: Maximize likelihood of generating the target rationale (Stage 1).

Formally: L = - sum(log P(t_j | X_{i,la}, X_{i,v}, t_<j))
Purpose: Maximize likelihood of generating the target answer (Stage 2).

Formally: L = - sum(log P(a_j | X'_{i}, a_<j)) where X' includes the rationale

Adaptation: Full fine-tuning of the student model

Trainable Parameters: Not explicitly reported in the paper (implied standard fine-tuning of UnifiedQA-Base ~220M)

Training Data:

ScienceQA training set (12,726 examples)
Generated QA-CoT data: Prompted LLM with simple instruction
Generated QA-PCoT data: 3-step prompting (Lecture -> Plan -> Rationale)
Mixed T-SciQ dataset: Union of QA-CoT and QA-PCoT based on validation accuracy per skill

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Multimodal-CoT: T-SciQ uses LLM-generated mixed signals (simple CoT + Plan-based CoT) instead of human annotations
vs. Reason-Teacher: T-SciQ incorporates a data mixing strategy to handle varying problem complexity (simple vs. complex) in a multimodal setting
vs. GPT-4: T-SciQ is a fine-tuned smaller model that outperforms the larger teacher on this specific domain via distillation

Limitations

Relies on the availability and quality of a strong LLM teacher (SciTeacher) to generate training data
The student model architecture is relatively small (UnifiedQA-Base); scaling to larger student backbones is not explored
No statistical significance tests reported for the improvements

Reproducibility

Code: https://github.com/T-SciQ/T-SciQ

Code is publicly available at https://github.com/T-SciQ/T-SciQ. Prompt templates for data generation are provided in the paper. Specific student model hyperparameters (LR, batch size) are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

Science Question Answering on the ScienceQA benchmark

Benchmarks:

ScienceQA (Multimodal Science Question Answering)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ScienceQA	Accuracy	91.68	96.18	+4.50
ScienceQA	Accuracy	88.40	96.18	+7.78
ScienceQA	Accuracy	86.54	96.18	+9.64
ScienceQA	Accuracy	94.24	96.18	+1.94
ScienceQA	Accuracy	94.94	96.18	+1.24

Main Takeaways

Mixing simple CoT and complex Plan-based CoT (PCoT) signals yields better performance than using either alone.
LLM-generated rationales can be more effective for teaching than human-annotated rationales, likely due to better coverage of external knowledge.
The student model successfully learns to generalize reasoning capabilities from the teacher, outperforming the teacher on the specific benchmark.
Consistent improvements across different question classes (Natural Science, Social Science, Language Science) and modalities (Text, Image, No Context).

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Knowledge Distillation / Student-Teacher training
Multimodal Transformers (Vision + Language)
Zero-shot prompting

Key Terms

CoT: Chain-of-Thought—a prompting method that encourages models to generate intermediate reasoning steps before the final answer

PCoT: Plan-based Chain-of-Thought—a reasoning approach where the model first generates a lecture and a plan to decompose a complex problem before solving it

ScienceQA: A large-scale multimodal dataset for science question answering containing questions with images, contexts, and lectures

Knowledge Distillation: A process where a large, capable 'teacher' model generates data to train a smaller, more efficient 'student' model

Zero-shot prompting: Asking a model to perform a task without providing any specific training examples in the prompt