TRINS: Towards Multimodal Language Models that Can Read

📝 Paper Summary

Visual Instruction Tuning Text-Rich Image Understanding Optical Character Recognition (OCR)

TRINS is a large-scale instruction-tuning dataset for text-rich images created via semi-automatic annotation, paired with a new model architecture (LaRA) that explicitly incorporates OCR tokens.

Core Problem

Existing large multimodal models struggle to comprehend textual content embedded within images (like posters or book covers) due to a lack of high-quality text-rich training data and low-resolution visual encoders.

Why it matters:

Current datasets like COCO and Conceptual Captions focus on natural images, leaving models unable to read or reason about text in real-world scenarios
Low-resolution visual encoders (e.g., CLIP) often fail to resolve small text, creating a bottleneck for document understanding tasks
Existing OCR-based VQA datasets lack the detailed instruction-following annotations needed to train general-purpose multimodal assistants

Concrete Example: When asked to summarize a book cover, standard multimodal models might describe the visual layout but fail to read the title or author correctly, whereas models trained on TRINS can extract and reason about the text.

Key Novelty

Semi-automatic Text-Rich Dataset Construction & OCR-Augmented Architecture

Creates a dataset (TRINS) by filtering LAION for text-rich images, using human annotators for detailed captions, and prompting GPT-4 with OCR/caption data to generate complex QA pairs
Proposes LaRA (Language-vision Reading Assistant), which bypasses visual encoder resolution limits by explicitly feeding OCR-extracted text tokens directly into the LLM alongside visual features

Architecture

Overview of the LaRA model architecture.

Evaluation Highlights

+202 point improvement on OCRBench score (548 vs 346) compared to the LLaVAR baseline
Achieves 62.8% accuracy on TRINS-VQA Extract questions, outperforming LLaVA 1.5 (38.8%) by a significant margin
Outperforms InstructBLIP and Qwen-VL on text-rich image captioning metrics (e.g., 186.6 CIDEr vs 23.5 and 79.4 respectively) on the TRINS-VQA Abstract benchmark

Breakthrough Assessment

8/10

Significant contribution to the specific sub-field of text-rich image understanding. The dataset construction pipeline is robust, and the simple OCR-injection architecture sets a strong baseline for reading-capable multimodal models.

⚙️ Technical Details

Problem Definition

Setting: Multimodal instruction tuning specifically for text-rich images (documents, posters, book covers)

Inputs: Image I containing text and a natural language instruction T_ins

Outputs: Text response T_res that answers the instruction based on visual and textual content in I

Pipeline Flow

Visual Encoding (CLIP-ViT)
OCR Text Extraction (Azure Read API / PaddleOCR)
Projection (Linear Layer)
Concatenation (Visual Tokens + OCR Tokens + Instruction)
Generation (LLM Decoder)

System Modules

Visual Encoder (Input Processing)

Extract visual features from the input image

Model or implementation: CLIP-ViT-L/14-336

Projection Layer (Input Processing)

Map visual features into the word embedding space of the language model

Model or implementation: Linear Projection Matrix W

OCR System (Input Processing)

Extract text strings directly from the image

Model or implementation: Azure Read API and PaddleOCR

Language Decoder

Generate the answer based on visual tokens, OCR tokens, and instruction

Model or implementation: Vicuna-1.5-13B

Novel Architectural Elements

Explicit injection of OCR-extracted text tokens into the LLM input stream alongside visual tokens to compensate for the visual encoder's inability to read small text

Modeling

Base Model: Vicuna-1.5-13B (Language Decoder), CLIP-ViT-L/14-336 (Visual Encoder)

Training Method: Visual Instruction Tuning (Supervised Fine-Tuning)

Objective Functions:

Purpose: Autoregressive language modeling objective.

Formally: Standard cross-entropy loss on the generated tokens.

Training Data:

TRINS-Cap: 39,153 text-rich images with human-annotated captions
TRINS-VQA: 102,437 QA pairs generated via GPT-4 using captions and OCR data
Combined with 158K LLaVA instruction-following data

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 32
schedule: cosine annealing

Compute: NVIDIA A100 80GB GPUs

Comparison to Prior Work

vs. LLaVA: LaRA explicitly adds OCR tokens to the input, whereas LLaVA relies solely on the visual encoder
vs. Qwen-VL: LaRA uses a simpler, lower-resolution visual encoder (336px) augmented with OCR, while Qwen-VL relies on higher resolution (448px) and massive pre-training
vs. LLaVAR: LaRA achieves better performance by utilizing the TRINS dataset which has longer, more detailed captions and denser text

Limitations

Relies on external OCR tools (Azure/PaddleOCR), inheriting their errors and latency
Visual encoder resolution (336x336) is a bottleneck for layout understanding despite OCR aid
Limited capability to directly extract text purely from vision if OCR fails (limited zero-shot optical character recognition without the external tool)

Reproducibility

Code: https://github.com/maker-mllm/maker

Dataset TRINS and LaRA code are publicly available. The exact training time is not reported. Use of proprietary APIs (Azure Read API, GPT-4) for data generation and OCR creates a closed-source dependency for full replication.

📊 Experiments & Results

Evaluation Setup

Zero-shot and fine-tuned evaluation on text-rich image understanding and captioning

Benchmarks:

TRINS-VQA (Text-Rich Visual Question Answering (Extract and Abstract subsets)) [New]
TRINS-Cap (Text-Rich Image Captioning) [New]
OCRBench (Comprehensive OCR capabilities benchmark)
TextVQA/DocVQA (Traditional Document VQA)

Metrics:

Accuracy (for Extract questions)
BLEU (B@1, B@4)
METEOR
ROUGE
CIDEr
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the new TRINS-VQA benchmark demonstrates LaRA's superiority in both extracting specific information and generating abstract descriptions.
TRINS-VQA (Extract)	Accuracy	38.8	62.8	+24.0
TRINS-VQA (Abstract)	CIDEr	79.4	186.6	+107.2
TRINS-VQA (Abstract)	CIDEr	97.2	186.6	+89.4
Evaluation on the external OCRBench dataset confirms generalization capabilities.
OCRBench	Final Score	346	548	+202
Captioning performance on TRINS-Cap shows LaRA generates more accurate and comprehensive descriptions.
TRINS-Cap	CIDEr	8.0	46.7	+38.7

Experiment Figures

Statistical comparison of TRINS against other datasets (OCR word count, Caption length, Question length, Answer length).

Main Takeaways

LaRA significantly outperforms existing state-of-the-art models on text-rich image understanding tasks, validating the OCR-injection strategy.
The TRINS dataset provides much longer and more detailed annotations (avg 65.1 words) compared to COCO (10.6) or TextCaps (12.4), enabling better training for abstract reasoning.
High-resolution encoders (like in Qwen-VL) help, but explicit OCR tokens (as in LaRA) provide a more efficient path to text understanding in images.
Fine-tuning on TRINS improves performance on general visual benchmarks (e.g., VSR, VizWiz) slightly, suggesting no degradation in general capabilities.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with Instruction Tuning
Basic knowledge of Optical Character Recognition (OCR) systems

Key Terms

OCR: Optical Character Recognition—technology that converts different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data

VQA: Visual Question Answering—a task where a system is given an image and a question about the image, and must produce an answer

Instruction Tuning: Fine-tuning language models on datasets of (instruction, output) pairs to improve their ability to follow user commands

CLIP: Contrastive Language-Image Pre-training—a neural network trained on a variety of (image, text) pairs suitable for zero-shot learning

Hallucination: A phenomenon where a model generates content that is nonsensical or unfaithful to the source content (e.g., describing objects not present in the image)

CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate image captioning quality by comparing generated captions to human reference captions

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation software in natural language processing

METEOR: Metric for Evaluation of Translation with Explicit ORdering—a metric for the evaluation of machine translation output