T-SciQ improves multimodal science question answering by training a small student model on a mix of simple Chain-of-Thought and complex Plan-based Chain-of-Thought rationales generated by a Large Language Model.
Core Problem
Existing Multimodal-CoT methods rely on human-annotated rationales, which are costly to collect and often lack essential external information or accuracy due to limited annotator expertise.
Why it matters:
Human annotation for complex scientific reasoning is expensive and time-consuming
Annotators often miss external knowledge required for correct reasoning
Captioning-based approaches lose visual information in complex images
Concrete Example:In a science question about animal classification (Figure 1), a human annotator might provide a simple rationale missing the specific biological definition, whereas an LLM can generate a detailed explanation involving external knowledge. Furthermore, simple CoT fails on complex multi-step problems where planning is needed.
Key Novelty
T-SciQ (Teaching Science Question Answering)
Generates two types of teaching data from an LLM: standard CoT for simple problems and Plan-based CoT (PCoT) for complex problems requiring decomposition
Uses a data mixing strategy governed by validation set performance to assign the optimal teaching rationale type (CoT vs. PCoT) to each skill category
Trains a smaller student model using a two-stage framework (rationale generation then answer inference) with these mixed synthetic signals
Architecture
The T-SciQ framework pipeline, including Teaching Data Generation, Data Mixing, and Two-stage Fine-tuning.
Evaluation Highlights
Achieves 96.18% accuracy on ScienceQA, setting a new state-of-the-art
Outperforms the best GPT-4 based few-shot baseline by 9.64%
Surpasses human performance (88.40%) by 7.78%
Breakthrough Assessment
9/10
Significant leap in performance on a major benchmark (ScienceQA), surpassing GPT-4 and human baselines by large margins using a smaller distilled model.
⚙️ Technical Details
Problem Definition
Setting: Multimodal Science Question Answering
Inputs: A problem P_i consisting of language input X_{i,la} (question, context, options) and visual input X_{i,v} (image)
Outputs: A rationale T_i and a final answer A_i
Pipeline Flow
Data Generation (Teacher): Generate QA-CoT and QA-PCoT samples using LLM
Data Mixing: Select best rationale type per skill via validation set
Stage 1 Training: Fine-tune student to generate rationale from question+image
Stage 2 Training: Fine-tune student to infer answer from question+image+rationale
System Modules
SciTeacher (Data Generator)
Generate synthetic rationales (CoT and PCoT) for training data
Model or implementation: Large Language Model (implied GPT-3.5/4 class, referred to as SciTeacher)
Rationale Generator (Student Model Inference)
Generate the reasoning explanation for a test question
Model or implementation: Multimodal-CoT architecture (Transformer + ViT + Gated Fusion)
Answer Inferer (Student Model Inference)
Predict the final answer option based on the question and generated rationale
Model or implementation: Multimodal-CoT architecture (Transformer + ViT + Gated Fusion)
Novel Architectural Elements
Data mixing strategy: Dynamically selecting between CoT and PCoT training signals for the student model based on validation set error rates per skill capability
Modeling
Base Model: Multimodal-CoT (based on UnifiedQA-Base and DETR-ResNet50)
Training Method: Supervised Fine-Tuning (Two-stage)
Objective Functions:
Purpose: Maximize likelihood of generating the target rationale (Stage 1).
Formally: L = - sum(log P(t_j | X_{i,la}, X_{i,v}, t_<j))
Purpose: Maximize likelihood of generating the target answer (Stage 2).
Formally: L = - sum(log P(a_j | X'_{i}, a_<j)) where X' includes the rationale
Adaptation: Full fine-tuning of the student model
Trainable Parameters: Not explicitly reported in the paper (implied standard fine-tuning of UnifiedQA-Base ~220M)
Training Data:
ScienceQA training set (12,726 examples)
Generated QA-CoT data: Prompted LLM with simple instruction
Generated QA-PCoT data: 3-step prompting (Lecture -> Plan -> Rationale)
Mixed T-SciQ dataset: Union of QA-CoT and QA-PCoT based on validation accuracy per skill
Key Hyperparameters:
learning_rate: Not reported in the paper
batch_size: Not reported in the paper
Compute: Not reported in the paper
Comparison to Prior Work
vs. Multimodal-CoT: T-SciQ uses LLM-generated mixed signals (simple CoT + Plan-based CoT) instead of human annotations
vs. Reason-Teacher: T-SciQ incorporates a data mixing strategy to handle varying problem complexity (simple vs. complex) in a multimodal setting
vs. GPT-4: T-SciQ is a fine-tuned smaller model that outperforms the larger teacher on this specific domain via distillation
Limitations
Relies on the availability and quality of a strong LLM teacher (SciTeacher) to generate training data
The student model architecture is relatively small (UnifiedQA-Base); scaling to larger student backbones is not explored
No statistical significance tests reported for the improvements
Code is publicly available at https://github.com/T-SciQ/T-SciQ. Prompt templates for data generation are provided in the paper. Specific student model hyperparameters (LR, batch size) are not detailed in the text.
📊 Experiments & Results
Evaluation Setup
Science Question Answering on the ScienceQA benchmark
Benchmarks:
ScienceQA (Multimodal Science Question Answering)
Metrics:
Accuracy (%)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
ScienceQA
Accuracy
91.68
96.18
+4.50
ScienceQA
Accuracy
88.40
96.18
+7.78
ScienceQA
Accuracy
86.54
96.18
+9.64
ScienceQA
Accuracy
94.24
96.18
+1.94
ScienceQA
Accuracy
94.94
96.18
+1.24
Main Takeaways
Mixing simple CoT and complex Plan-based CoT (PCoT) signals yields better performance than using either alone.
LLM-generated rationales can be more effective for teaching than human-annotated rationales, likely due to better coverage of external knowledge.
The student model successfully learns to generalize reasoning capabilities from the teacher, outperforming the teacher on the specific benchmark.
Consistent improvements across different question classes (Natural Science, Social Science, Language Science) and modalities (Text, Image, No Context).
📚 Prerequisite Knowledge
Prerequisites
Chain-of-Thought (CoT) prompting
Knowledge Distillation / Student-Teacher training
Multimodal Transformers (Vision + Language)
Zero-shot prompting
Key Terms
CoT: Chain-of-Thought—a prompting method that encourages models to generate intermediate reasoning steps before the final answer
PCoT: Plan-based Chain-of-Thought—a reasoning approach where the model first generates a lecture and a plan to decompose a complex problem before solving it
ScienceQA: A large-scale multimodal dataset for science question answering containing questions with images, contexts, and lectures
Knowledge Distillation: A process where a large, capable 'teacher' model generates data to train a smaller, more efficient 'student' model
Zero-shot prompting: Asking a model to perform a task without providing any specific training examples in the prompt