← Back to Paper List

On Scaling Up a Multilingual Vision and Language Model

X Chen, J Djolonga, P Padlewski, B Mustafa…
Google
Proceedings of the IEEE …, 2024 (2024)
MM Pretraining Benchmark QA

📝 Paper Summary

Vision-Language Pretraining (VLP) Multimodal Large Language Models (MLLMs) Multilingual Vision-Language Modeling
PaLI-X scales a multilingual vision-language model to 55 billion parameters by jointly scaling vision and language components, utilizing a diverse objective mixture to achieve state-of-the-art performance across 20+ benchmarks.
Core Problem
Prior vision-language models typically scale only one component (vision or language) or rely heavily on external OCR systems, limiting their ability to perform complex tasks like document understanding and multilingual reasoning.
Why it matters:
  • Unilateral scaling (scaling only text or only vision) creates bottlenecks in multimodal understanding.
  • Existing models struggle with tasks requiring fine-grained text-in-image understanding (e.g., charts, infographics) without specialized pipelines.
  • Few-shot learning often degrades fine-tuned performance; finding a training recipe that balances both is crucial for general-purpose models.
Concrete Example: In complex counting tasks like 'how many giraffes are drinking water', smaller models or those with weak vision backbones fail to align the specific action 'drinking' with the objects 'giraffes', leading to incorrect counts. PaLI-X solves this by processing high-resolution visual inputs alongside language.
Key Novelty
Jointly Scaled Multilingual Vision-Language Model (PaLI-X)
  • Scales both the visual encoder (ViT-22B) and language decoder (32B) simultaneously, maintaining a balanced capacity split (~40% vision, ~60% language) unlike prior works that skew heavily towards one.
  • Integrates OCR-specific pretraining objectives (e.g., spotting text in images) directly into the visual encoder, allowing the model to 'read' text in images without always needing external tools.
  • Utilizes a mixture of objectives (prefix-completion and masked-token completion) to improve the Pareto frontier between few-shot capability and fine-tuning performance.
Evaluation Highlights
  • Achieves 86.0 accuracy on VQAv2 (test-std), surpassing the previous 84.3 state-of-the-art established by PaLI.
  • Improves TallyQA (complex counting) performance by +18.8 points over specialized counting models like MoVie.
  • Reaches 84.5 accuracy on TextVQA, significantly outperforming the previous best of 79.9.
Breakthrough Assessment
9/10
Sets new state-of-the-art results on over 20 diverse benchmarks. Demonstrates strong emergent properties (counting, multilingual detection) and effectively balances few-shot and fine-tuning performance at scale.
×