Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards

📝 Paper Summary

Multimodal Reasoning Chart Understanding Reinforcement Learning from Verifiable Rewards (RLVR)

Chart-RL enhances vision-language models' chart comprehension by using reinforcement learning with mathematically verifiable rewards on complex reasoning tasks, achieving robust generalization without large-scale supervision.

Core Problem

Existing Vision-Language Models (VLMs) struggle with multi-step reasoning on charts because Supervised Fine-Tuning (SFT) often leads to overfitting on specific templates and fails to generalize to unseen chart types or complex queries.

Why it matters:

Charts compress dense information into diverse visual structures (bars, pies, plots), making them fundamentally harder than natural images for standard VLMs to interpret reliably.
SFT methods frequently suffer from catastrophic forgetting and poor transferability, meaning a model trained on bar charts might fail on scatter plots or complex math questions.
Current approaches rely on massive curated datasets which are costly to produce and may still miss real-world complexity.

Concrete Example: When asked a multi-step question like 'What is the ratio of the highest value in 2020 to the lowest in 2021?', an SFT-trained model might extract one number correctly but fail the arithmetic or the comparison, whereas Chart-RL learns the full reasoning path via reward feedback.

Key Novelty

Reinforcement Learning with Verifiable Rewards (RLVR) for Charts

Treats chart QA as a reasoning task with deterministic answers (e.g., numerical values), allowing the use of rule-based accuracy rewards instead of human preference labels.
Uses Group Relative Policy Optimization (GRPO) to encourage the model to explore reasoning paths that lead to the correct mathematical answer, rather than just imitating training text.
Demonstrates that training on a small set of complex, multi-step reasoning tasks (Hard Task) transfers better to general chart understanding than training on thousands of simple extraction tasks.

Architecture

The Chart-RL training framework using GRPO with verifiable rewards.

Evaluation Highlights

+16.7% relative improvement on MultiChartQA compared to Supervised Fine-Tuning (SFT) using the Qwen2.5-VL-3B-Instruct baseline.
+11.5% relative improvement on ChartInsights compared to SFT.
Achieves strong performance with only 10 complex training examples, significantly outperforming models trained on 6,000+ simple examples.

Breakthrough Assessment

8/10

Significant for demonstrating that RLVR (popular in LLM math) works effectively for VLM chart reasoning. The finding that task complexity outweighs data quantity (10 hard vs 6000 easy samples) is a strong efficiency result.

⚙️ Technical Details

Problem Definition

Setting: Chart Question Answering where the answer is mathematically deterministic.

Inputs: A chart image and a natural language query q.

Outputs: A structured response containing a reasoning trace (<thinking>...</thinking>) and a final answer (<answer>...</answer>).

Pipeline Flow

Input Processing (Image + Query)
VLM Inference (Policy Sampling)
Reward Calculation (Accuracy + Format)
Policy Update (GRPO)

System Modules

VLM Backbone

Generates reasoning traces and answers based on chart inputs

Model or implementation: Qwen2.5-VL-3B-Instruct (base model)

Reward Function: Accuracy (Training / Feedback)

Evaluates if the predicted numerical answer matches the ground truth within a tolerance threshold

Model or implementation: Rule-based function

Reward Function: Format (Training / Feedback)

Enforces strict output structure (thinking tags + JSON answer tags)

Model or implementation: Regex/Rule-based checker

Novel Architectural Elements

Integration of verifiable accuracy rewards specifically for visual chart reasoning within a GRPO framework, distinct from general VLM RL.

Modeling

Base Model: Qwen2.5-VL-3B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize the likelihood of high-advantage responses.

Formally: Policy gradient objective utilizing advantage estimates from group sampling.
Purpose: Accuracy Reward.

Formally: S(v_p, v_g) based on relative error |v_p - v_g| / |v_g| compared to threshold tau.
Purpose: Format Reward.

Formally: Binary 1.0 if output matches template (<thinking>...</thinking><answer>...</answer>), 0 otherwise.

Adaptation: LoRA (rank=64, alpha=128)

Trainable Parameters: LoRA adapters + Vision modules (unfrozen)

Training Data:

Easy Task: 6,200 charts from PlotQA (simple extraction)
Hard Task: 448 charts from CharXiv (multi-step reasoning)

Key Hyperparameters:

lora_rank: 64
lora_alpha: 128
dropout: 0.05
+ 4 more
per_device_batch_size: 1
gradient_accumulation_steps: 4
num_generations_per_prompt: 8
gpus: 8 NVIDIA H100

Comparison to Prior Work

vs. SFT/CoT-SFT: Chart-RL uses RL signals (outcome verification) rather than just imitating traces, leading to better robustness and generalization.
vs. VLM-R1: Chart-RL adapts the R1 paradigm specifically for Chart QA with numerical accuracy rewards, whereas VLM-R1 focused on natural images (REC/OVD).
vs. Pix2Struct/UniChart [cited in paper]: Chart-RL uses a post-training RL stage on a VLM backbone, rather than pre-training/SFT specifically for chart-to-text.

Limitations

Minor performance degradation observed in specific transformations like log scales and scaling perturbations.
Requires problems with mathematically verifiable ground truths (not applicable to open-ended qualitative chart interpretation).
Dependence on a strong base model (Qwen2.5-VL) to generate initial reasonable outputs for RL exploration.

Reproducibility

Code availability is not explicitly provided in the text. Training data subsets (PlotQA, CharXiv) are from public datasets. Prompts for RL training are described. The base model Qwen2.5-VL-3B-Instruct is open weights.

📊 Experiments & Results

Evaluation Setup

Chart Question Answering across diverse benchmarks involving extraction, reasoning, and robustness checks.

Benchmarks:

MultiChartQA (Multi-hop reasoning across multiple charts)
ChartInsights (Fine-grained analytics across 7 chart types)
RobustCQA (Robustness to visual perturbations)
MathVerse (Visual mathematical problem solving (Out-of-Domain))

Metrics:

Accuracy (Exact Match or Relaxed Accuracy depending on dataset)
Statistical methodology: Reported statistical significance at alpha=0.05 for main results.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results demonstrating Chart-RL superiority over SFT baselines on chart comprehension benchmarks.
MultiChartQA	Accuracy	44.1	58.1	+14.0
ChartInsights	Accuracy	48.2	53.7	+5.5
Robustness analysis shows Chart-RL improves consistency across visual perturbations.
RobustCQA	Category Improvement Count	2	18	+16
Out-of-domain generalization to visual math problems.
MathVerse	Accuracy	28.8	44.8	+16.0

Experiment Figures

Training dynamics (reward curves) comparing different data scales (10, 100, 448 samples).

Main Takeaways

Task complexity is more critical than data quantity: training on 10 complex examples outperformed training on 6,000 simple examples.
Chart-RL improves robustness to visual variations (layout, style) better than SFT.
RL training on charts facilitates transfer to out-of-domain tasks like visual mathematics (MathVerse), suggesting learned reasoning skills are generalizable.
SFT often leads to regression compared to the baseline VLM on complex chart tasks, whereas RL consistently improves performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics
Vision-Language Models (VLMs)
Supervised Fine-Tuning (SFT)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness (like math answers) as the reward signal rather than a learned reward model.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, removing the need for a separate critic model.

SFT: Supervised Fine-Tuning—training a model on input-output pairs to mimic the desired behavior.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices.

PEFT: Parameter-Efficient Fine-Tuning—methods like LoRA that reduce the computational cost of fine-tuning large models.

VLM: Vision-Language Model—a model capable of processing and understanding both images and text.