Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

📝 Paper Summary

Vision-Language Models (VLMs) Chart Understanding Visual Question Answering (VQA)

ChartPaLI-5B transfers reasoning capabilities from large language models to smaller vision-language models via a chart-to-table pre-training mixture and fine-tuning on synthetic reasoning traces generated by LLMs.

Core Problem

Small Vision-Language Models (VLMs) often lack complex reasoning capabilities required for chart understanding, such as performing arithmetic or implicit information extraction, compared to their larger counterparts.

Why it matters:

Current models fail to contextually combine image and text representations effectively for complex queries involving multiple reasoning steps.
Transferring reasoning from large to small models reduces serving costs while maintaining or improving task performance.
Existing small models like PaLI-3 fall behind larger models (e.g., PaLI-X) on benchmarks like ChartQA due to limited reasoning skills.

Concrete Example: When asked 'What is the sum of values for 2020 and 2021?' on a bar chart, a standard small VLM might just retrieve one value or hallucinate, whereas the proposed method generates a reasoning trace (lookup 2020 value, lookup 2021 value, add them) to derive the correct answer.

Key Novelty

ChartPaLI-5B (Reasoning Transfer Recipe)

Continues pre-training the vision backbone on a diverse 'chart-to-table' mixture to learn better internal structural representations of charts.
Augments training data by 20x using an LLM (PaLM 2) to synthesize reasoning traces (rationales) and additional question-answer pairs derived from the tabular representation of charts.
Fine-tunes using a multi-task loss that treats rationale generation and answer prediction as separate but joint tasks, balancing their importance.

Architecture

The conceptual flow of the pre-training and fine-tuning recipe.

Evaluation Highlights

Obtains State-of-the-Art (SoTA) on ChartQA among models <10B parameters, outperforming the 55B parameter PaLI-X.
Achieves 81.3% accuracy on ChartQA, surpassing GPT-4V (78.5%) and Gemini Ultra (80.8%) when using Program-of-Thought (PoT) refinement.
Outperforms the PaLI-3 baseline by ~10% absolute accuracy on ChartQA through the proposed pre-training and fine-tuning recipe.

Breakthrough Assessment

8/10

Significant efficiency breakthrough: achieves SoTA on a complex reasoning benchmark using a 5B model, outperforming 10x larger models and proprietary giants like GPT-4V, largely due to data quality and multi-task transfer learning.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering on Charts (Chart VQA)

Inputs: A chart image I and a natural language question Q

Outputs: A natural language answer A (and optionally a rationale R)

Pipeline Flow

Input Processing (Image + Text)
Encoder-Decoder Processing
Output Generation (Answer + Rationale)

System Modules

Vision Backbone

Encodes the chart image into visual embeddings

Model or implementation: ViT-G/14 (2B parameters) initialized from SigLIP

Language Backbone

Processes text input and visual embeddings to generate text output

Model or implementation: UL2-3B (Encoder-Decoder)

Multi-task Head (Logical)

Differentiates between tasks using text prefixes ('Rationale:', 'Question:') to generate either reasoning traces or answers

Model or implementation: Part of UL2 Decoder

Novel Architectural Elements

Integration of a 'chart derendering' pre-training mixture specifically to improve the vision backbone's structural understanding of charts before downstream fine-tuning
Multi-task fine-tuning architecture where rationale generation and QA are treated as independent tasks with a weighted loss balance (lambda)

Modeling

Base Model: PaLI-3 (ViT-2B vision + UL2-3B language)

Training Method: Two-stage training: (1) Continued Pre-training on Chart2Table, (2) Multi-task Fine-tuning

Objective Functions:

Purpose: Balance the learning of answering questions and generating rationales.

Formally: Loss = (1 - lambda) * Loss_ans + lambda * Loss_rat

Training Data:

Continued Pre-training: Mixture of synthetic chart-to-table, Masry et al. mixture, DVQA, TaTA, Benetech (total mixture weights defined in Table 3)
Fine-tuning: ChartQA (Human & Augmented) + Synthetic Data (ChartQA-Rationale, ChartQA-ExtraQAR, ArithmeticQA)
Synthetic data generated via PaLM 2-S/L using 4-shot (rationales) and 1-shot (extra QA) prompts

Key Hyperparameters:

pre_training_steps: 6000
pre_training_batch_size: 256
pre_training_learning_rate: 5e-3
+ 5 more
fine_tuning_steps: 10000
fine_tuning_batch_size: 128
fine_tuning_learning_rate: 1e-3
loss_lambda: 0.5 (default/best per ablation)
resolution: 812x812

Compute: Not reported in the paper

Comparison to Prior Work

vs. PaLI-3: Adds chart-specific pre-training and synthetic reasoning data fine-tuning
vs. PaLI-X: Achieves better performance with 10x fewer parameters through specialized data
vs. MatCha: Uses a more extensive chart-to-table mixture and multi-task reasoning fine-tuning
+ 2 more
vs. UniChart: Uses a UL2 backbone instead of BART and incorporates explicit rationale generation tasks
vs. Gemini Ultra: Outperforms on ChartQA when combined with Program-of-Thought, despite being significantly smaller

Limitations

Relies on gold tables for synthetic data generation; errors in inferred tables for datasets like Pew could propagate noise (though claimed resilient).
No verification step for hallucinations or fluency in the synthetic ExtraQA dataset.
The method requires training on specific synthetic datasets which must be generated by a larger, more capable LLM (distillation dependency).
The improvements are demonstrated primarily on ChartQA, with limited exploration of other multimodal domains.

Reproducibility

Code availability is not provided. Synthetic data generation prompts (Figure 4, 5) and templates are described. The base model (PaLI-3) architecture is described in prior work (Chen et al. 2023c), but weights for ChartPaLI-5B are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Visual Question Answering on Chart images

Benchmarks:

ChartQA (Complex reasoning on charts (Human and Augmented sets))
PlotQA (Chart QA)
FigureQA (Chart QA)

Metrics:

Accuracy (Relaxed Accuracy usually used for ChartQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ChartPaLI-5B achieves state-of-the-art results on ChartQA, significantly outperforming its base model and larger competitors.
ChartQA	Accuracy	66.0	81.3	+15.3
ChartQA	Accuracy	70.9	72.0	+1.1
ChartQA	Accuracy	80.8	81.3	+0.5
The model demonstrates strong transfer performance on other chart benchmarks.
PlotQA	Accuracy	75.7	79.0	+3.3
FigureQA	Accuracy	90.7	94.6	+3.9
Ablation studies confirm the value of the pre-training mixture and synthetic data strategies.
ChartQA	Accuracy	66.6	68.6	+2.0
ChartQA	Accuracy	68.6	72.0	+3.4

Experiment Figures

Performance on ChartQA validation set as a function of the multi-task loss weight lambda.

Main Takeaways

Specialized pre-training on chart-to-table translation effectively teaches the vision backbone to understand chart structure, yielding ~2% accuracy gain.
Synthesizing reasoning traces (rationales) and additional complex QA pairs using LLMs allows smaller VLMs to learn complex reasoning, providing the largest performance boost (+3.4%).
Multi-task fine-tuning (separating Answer and Rationale tasks) is superior to single-task or sequential prediction, maintaining inference speed while improving quality.
Program-of-Thought (PoT) refinement, where the model generates code to solve the query, further boosts performance to surpass top-tier proprietary models like GPT-4V.

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures (ViT, Encoder-Decoder)
Vision-Language Pre-training
Chain-of-Thought (CoT) prompting
Multi-task learning

Key Terms

PaLI-3: A smaller scale (5B parameter) Vision-Language Model consisting of a ViT vision backbone and a UL2 language backbone

SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive loss function used for training vision encoders

UL2: Unifying Language Learning—a pre-training objective for language models that mixes different denoising tasks

ChartQA: A benchmark dataset for question answering on charts, containing both human-written and machine-generated questions

Rationale: A step-by-step explanation or reasoning trace generated by a model to justify an answer

Program-of-Thought (PoT): A prompting technique where the model generates executable code (like Python) to solve reasoning problems, rather than just text

Derendering: The task of translating a visual chart back into its underlying data table or code representation

Multi-task setup: Training a model to perform multiple distinct tasks (e.g., answering questions and generating rationales) simultaneously with specific prefixes

Vision Transformer (ViT): A model architecture that applies the Transformer mechanism directly to sequences of image patches

OCR: Optical Character Recognition—technology to convert images of text into machine-encoded text