PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

📝 Paper Summary

Document Parsing Vision-Language Models (VLMs) Optical Character Recognition (OCR)

PaddleOCR-VL-1.5 combines a geometry-aware layout engine with a compact 0.9B vision-language model to achieve robust parsing of physically distorted documents and support text spotting.

Core Problem

Existing document parsers are optimized for flat, digital-born documents and fail when facing real-world physical distortions like warping, skewing, and erratic lighting.

Why it matters:

Real-world documents captured via mobile phones often exhibit non-rigid warping and severe perspective skew, breaking standard axis-aligned detection models.
Accurate parsing is the prerequisite for RAG (Retrieval-Augmented Generation) systems to ingest knowledge, but current failures in layout analysis lead to jumbled or missing context.
Most state-of-the-art models are too computationally heavy or lack specific training for 'in-the-wild' noise like screen moiré patterns or seals.

Concrete Example: When parsing a receipt photographed at a steep angle with a curved surface, standard models predict overlapping rectangular boxes that mix text columns. PaddleOCR-VL-1.5 uses pixel-accurate segmentation to isolate the curved text regions and correctly orders them.

Key Novelty

Geometry-Aware Robust Document Parsing (PaddleOCR-VL-1.5)

Replaces standard bounding box detection with PP-DocLayoutV3, a mask-based instance segmentation engine that handles non-planar/warped document shapes.
Integrates reading order prediction directly into the layout transformer's decoder, allowing simultaneous geometric localization and logical sequencing in one pass.
Expands the VLM's capabilities to include Seal Recognition and Text Spotting (grounded OCR) within a compact 0.9B parameter budget.

Architecture

The dual-path framework for Document Parsing and Text Spotting.

Evaluation Highlights

Achieves 94.5% accuracy on OmniDocBench v1.5, establishing a new state-of-the-art for general document parsing.
Achieves 92.05% overall accuracy on the newly curated Real5-OmniDocBench, which specifically targets physical distortions like warping and skew.
Outperforms massive general VLMs (Vision-Language Models) like Qwen3-VL-235B and Gemini-3 Pro on robustness benchmarks despite having only 0.9B parameters.

Breakthrough Assessment

8/10

Significant engineering advance in robust document parsing. The shift to mask-based layout analysis for distorted docs addresses a major pain point, and the 0.9B efficiency is highly practical.

⚙️ Technical Details

Problem Definition

Setting: Multi-task document intelligence including parsing (layout + content) and spotting (location + text)

Inputs: Document image I containing text, tables, formulas, or seals

Outputs: Structured text (Markdown/JSON) representing layout and content, or text with bounding coordinates

Pipeline Flow

Layout Analysis (PP-DocLayoutV3)
Element Recognition (PaddleOCR-VL-1.5-0.9B)
Post-processing & Formatting

System Modules

PP-DocLayoutV3

Detect document elements (text, tables, figures) and determine reading order, handling physical distortions

Model or implementation: RT-DETR based Transformer with Mask head

PaddleOCR-VL-1.5-0.9B

Recognize text and content within identified regions or perform end-to-end spotting

Model or implementation: NaViT Visual Encoder + ERNIE-4.5-0.3B LLM

Post-Processor

Assemble recognized content into structured output

Model or implementation: Rule-based engine

Novel Architectural Elements

Integration of Reading Order Prediction directly into the RT-DETR decoder queries, replacing decoupled pointer networks
Transition from box-based detection to mask-based instance segmentation for layout elements to handle warping

Modeling

Base Model: ERNIE-4.5-0.3B (Language Backbone) with NaViT-style Vision Encoder

Training Method: Three-stage pipeline: Pre-training -> SFT -> GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize geometric localization and logical sequencing simultaneously.

Formally: Joint loss including classification, mask segmentation, and pairwise precedence scoring for reading order.
Purpose: Align visual features with text and teach coordinate prediction.

Formally: Next-token prediction loss on image-text pairs.
Purpose: Unify label styles and handle rare cases.

Formally: Group Relative Policy Optimization (GRPO) maximizing reward based on style consistency and correctness.

Adaptation: Full fine-tuning of the 0.9B model

Trainable Parameters: 0.9 Billion

Training Data:

Pre-training: 46 million image-text pairs (scaled up from 29M)
Layout Training: 38k manually annotated document samples with reading order
Spotting: Max resolution increased to 2048x28x28 pixels

Key Hyperparameters:

learning_rate: 2e-4 (Layout training)
weight_decay: 0.0001
batch_size: 32 (Layout training)
+ 2 more
epochs: 150 (Layout training)
optimizer: AdamW

Comparison to Prior Work

vs. DeepSeek-OCR: PaddleOCR-VL-1.5 focuses on handling physical distortions (warps/seals) explicitly via segmentation masks, whereas DeepSeek focuses on high-ratio compression
vs. MinerU2.5: MinerU typically handles planar documents; PaddleOCR-VL-1.5 is engineered for 'in-the-wild' non-planar photography
vs. Qwen3-VL-235B: PaddleOCR-VL-1.5 achieves better robustness on distorted docs with 200x fewer parameters (0.9B vs 235B)

Limitations

No specific baseline metric values (e.g., exact accuracy of Qwen/Gemini) provided in the text snippet to quantify the 'significant' improvement margin.
Reliance on large-scale proprietary pre-training data (46M pairs) limits full reproduction from scratch.
The 0.9B model size, while efficient, may lack the broad world knowledge of larger VLMs for reasoning tasks beyond parsing.

Reproducibility

Code: https://github.com/PaddlePaddle/PaddleOCR

Code is publicly available at https://github.com/PaddlePaddle/PaddleOCR. Models are on HuggingFace. Training data (Real5-OmniDocBench) is a curated subset of OmniDocBench v1.5. Pre-training datasets are internal/proprietary (46M pairs).

📊 Experiments & Results

Evaluation Setup

Document parsing accuracy evaluated on standard and distortion-focused benchmarks.

Benchmarks:

OmniDocBench v1.5 (General Document Parsing)
Real5-OmniDocBench (Robustness to Physical Distortions) [New]

Metrics:

Accuracy (Overall Parsing)
Recall/Precision (implicitly via accuracy)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

PaddleOCR-VL-1.5 achieves 94.5% accuracy on OmniDocBench v1.5, setting a new SOTA.
On the new Real5-OmniDocBench (focused on distortions), the model achieves 92.05% accuracy, demonstrating superior robustness compared to standard baselines.
The model significantly outperforms much larger VLMs (Qwen3-VL-235B, Gemini-3 Pro) on document parsing tasks, validating the efficiency of the specialized 0.9B architecture.
Jointly optimizing detection, segmentation, and reading order in PP-DocLayoutV3 eliminates cascading errors common in multi-stage pipelines.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architectures (Encoder-Decoder)
Knowledge of Object Detection (specifically DETR/RT-DETR frameworks)
Familiarity with RL alignment techniques like PPO or GRPO

Key Terms

VLM: Vision-Language Model—a model capable of processing and understanding both images and text inputs

OCR: Optical Character Recognition—conversion of images of typed, handwritten, or printed text into machine-encoded text

RAG: Retrieval-Augmented Generation—systems that improve LLM outputs by referencing external knowledge bases

RT-DETR: Real-Time DEtection TRansformer—an efficient object detection architecture used here as the backbone for layout analysis

NaViT: Native Resolution Vision Transformer—a visual encoder that processes images at their original aspect ratios to avoid resizing artifacts

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that updates policies based on relative performance within a group of samples, used here to align output styles

SFT: Supervised Fine-Tuning—training a model on labeled datasets to specialize it for specific tasks

Text Spotting: The task of simultaneously detecting the location of text and recognizing its content

Mask-based detection: Predicting pixel-level shapes (masks) rather than just rectangular boxes, essential for non-rectangular (warped) elements

Global Pointer Mechanism: A technique used here to predict the reading order by modeling the precedence relationships between document elements