TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

📝 Paper Summary

Table Recognition (TR) Document Parsing Self-Supervised Learning

TRivia improves table recognition models by fine-tuning vision-language models on unlabeled table images using reinforcement learning, where rewards are derived from the model's ability to answer synthesized questions about the table.

Core Problem

State-of-the-art table recognition relies on massive labeled datasets or proprietary APIs (like Gemini 2.5 Pro) that are costly, privacy-invasive, and unavailable for open-source training.

Why it matters:

High-quality labeled table data (image-HTML pairs) is expensive and time-consuming to curate at scale
Open-source models lag significantly behind proprietary models due to data scarcity
Distilling from proprietary models is costly, violates service agreements, and caps performance at the teacher's level

Concrete Example: When an open-source model like UniTable processes a complex real-world table, it often fails due to limited context window (448x448) and insufficient training data, while TRivia-3B can handle it by learning from unlabeled wild data via QA feedback.

Key Novelty

Self-Supervised TR via QA-based Reinforcement Learning

Instead of needing ground-truth HTML labels, the model is rewarded if its table recognition output allows an external LLM to correctly answer questions about the table
Uses a 'response-consistency' sampling strategy to select only the most informative unlabeled images (where the model is uncertain) for training
Generates synthetic questions (QA pairs) using an attention-guided mechanism to ensure questions cover diverse parts of the table and are visually grounded

Architecture

The TRivia framework overview, showing the data preparation stage (Question Generation) and the in-training stage (GRPO with QA rewards).

Evaluation Highlights

Surpasses Gemini 2.5 Pro and GPT-5 on the CC-OCR benchmark (84.15 vs 79.46 TEDS)
Outperforms MinerU2.5 (a 26B parameter model trained on millions of samples) using only a 3B parameter model
Achieves 86.85 TEDS on OmniDocBench, beating Qwen2.5-VL-72B (81.65) despite being significantly smaller

Breakthrough Assessment

9/10

Demonstrates that a small 3B model can beat massive proprietary models (Gemini, GPT-5) on specialized tasks using only unlabeled data and self-supervision, a significant shift from the supervised/distillation paradigm.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised fine-tuning of a Vision-Language Model (VLM) for Table Recognition (TR)

Inputs: Unlabeled table image I

Outputs: Semi-structured text representation of the table (e.g., HTML, OTSL)

Pipeline Flow

Table Image → TR Model (Policy) → [Multiple Recognition Hypotheses]
Hypotheses + Generated Questions → LLM (Answerer) → Predicted Answers
Predicted Answers vs Ground Truth Answers → Reward Calculation (F1)
Reward → GRPO Update → TR Model

System Modules

TR Model (Policy)

Generate structured table representations (OTSL) from images

Model or implementation: Qwen2.5-VL-3B-Instruct

QA Generation (Data Engine)

Generate diverse, verifiable questions from unlabeled images to serve as supervision

Model or implementation: Qwen2.5-VL-72B-Instruct

LLM (Answerer)

Answer the generated questions using the TR model's output as context

Model or implementation: Qwen3-8B

Novel Architectural Elements

Closed-loop self-supervised pipeline where the reward signal comes purely from a downstream proxy task (QA) rather than ground truth labels
Attention-guided QA generation module that filters synthetic questions based on visual token overlap to ensure diverse table coverage

Modeling

Base Model: Qwen2.5-VL-3B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize the TR model to maximize the accuracy of answers derived from its outputs.

Formally: Maximize expectation of reward r = Avg_QA(F1(Answer_pred, Answer_gt))

Adaptation: Full fine-tuning

Training Data:

Stage 3 (RL): ~50K informative unlabeled images selected from 100K web PDFs using response-consistency sampling (TEDS score range 0.4-1.0)
QA Data: ~30 diverse QA pairs per image generated by Qwen2.5-VL-72B

Key Hyperparameters:

attention_threshold_tau_A: 0.01
iou_threshold_tau_IOU: 0.3
sampling_consistency_range: 0.4–1.0
+ 1 more
QA_generation_layer: 72

Compute: Not reported in the paper

Comparison to Prior Work

vs. MinerU2.5: TRivia uses 10x fewer parameters (3B vs ~26B) and no distillation from proprietary models, relying on self-supervision
vs. Gemini 2.5 Pro: TRivia is open-source and capable of running locally while outperforming Gemini on specialized benchmarks
vs. Synthetic Data methods (e.g., MonkeyOCR): TRivia learns from real-world unlabeled images directly, avoiding the domain gap of synthetic rendering

Limitations

Reliance on the quality of the QA generation model (teacher VLM)
Computational cost of generating QA pairs for large-scale unlabeled data
Potential instability if the reward model (LLM answerer) hallucinates answers
Performance depends on the diversity of the unlabeled image pool

Reproducibility

Code: https://github.com/opendatalab/TRivia

publicly available (https://github.com/opendatalab/TRivia). Model weights and code released. Unlabeled data curation process described in detail.

📊 Experiments & Results

Evaluation Setup

Table Recognition (TR) evaluated by converting images to structured text and comparing to ground truth.

Benchmarks:

OmniDocBench v1.5 (Digital PDF table recognition)
CC-OCR (Scanned/photographed diverse tables)
OCRBench v2 (Table parsing subset)

Metrics:

TEDS (Tree Edit Distance-based Similarity)
S-TEDS (Structure-only TEDS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TRivia-3B outperforms both open-source and proprietary models on challenging real-world benchmarks.
CC-OCR	TEDS	79.46	84.15	+4.69
CC-OCR	TEDS	77.53	84.15	+6.62
OmniDocBench	TEDS	81.65	86.85	+5.20
OCRBench v2	TEDS	57.24	80.27	+23.03
Ablation studies confirm the value of attention-guided QA generation and filtering.
CC-OCR	TEDS	81.63	84.15	+2.52

Experiment Figures

Radar chart comparing TRivia-3B against SOTA models (Gemini 2.5 Pro, GPT-5, MinerU2.5) across three benchmarks.

Main Takeaways

Self-supervised learning on unlabeled data is highly effective for Table Recognition, surpassing purely supervised methods.
The proxy task of 'answering questions about a table' provides a sufficient reward signal to learn complex structural parsing.
Selecting informative samples (where model is uncertain) and ensuring diverse QA coverage (via attention) are critical for efficiency.
Small specialized models (3B) can outperform massive generalist models (72B, proprietary) when fine-tuned with high-quality, targeted self-supervision.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Reinforcement Learning (specifically GRPO)
Table Recognition metrics (TEDS)
Visual Attention mechanisms

Key Terms

TR: Table Recognition—converting table images into structured text like HTML or Markdown

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing a group of outputs for the same input rather than using a critic model

TEDS: Tree Edit Distance-based Similarity—a metric for evaluating table recognition that measures the structural similarity between the predicted and ground-truth HTML trees

OTSL: Optimized Table Structure Language—a compact tag-based format for representing tables that encodes adjacency rather than verbose colspan/rowspan attributes

Visual Source (VS): The set of image tokens that an attention mechanism focuses on when generating a specific text token, used here to verify if a question is grounded in the image

QA: Question Answering—used here as a proxy task to verify if the recognized table structure is accurate enough to answer questions

VLM: Vision-Language Model—a model capable of processing both images and text inputs