← Back to Paper List

AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li
Zhejiang University, Hosei University
arXiv (2025)
MM Benchmark RL Pretraining QA

📝 Paper Summary

Agricultural Vision-Language Models Domain-Specific Multimodal LLMs
AgriGPT-VL establishes a unified agricultural AI ecosystem by generating a massive domain-specific multimodal dataset via multi-agent refinement and training a specialized model with progressive curriculum alignment.
Core Problem
General-purpose multimodal models lack specialized knowledge for agriculture, leading to factual inaccuracies and hallucinations when interpreting crop or pest imagery.
Why it matters:
  • Existing agricultural models are mostly text-only (AgriGPT) or limited to narrow classification tasks (pest recognition), missing complex reasoning capabilities
  • General models (GPT-4V, LLaVA) trained on web data fail to capture specialized agricultural semantics essential for real-world farming decisions
  • Fragmented resources prevent scalable progress; no single ecosystem integrates large-scale data, specialized modeling, and rigorous benchmarking
Concrete Example: When asked to identify a specific pest or diagnose a crop disease from an image, general models often provide generic or hallucinated answers because they lack domain-specific visual grounding, whereas AgriGPT-VL uses specialized training to accurately identify the species and suggest management.
Key Novelty
AgriGPT-VL Suite (Dataset + Model + Benchmark)
  • Constructs the largest agricultural V-L dataset (Agri-3M-VL) using a transferable 'Data Generator' pipeline that synthesizes captions and QA pairs from raw images, refined by a multi-agent team (Feedback, Evaluation, Rethinking agents)
  • Trains a specialized VLM using a progressive curriculum: starts with text-only grounding, moves to shallow caption alignment, then deep VQA reasoning, and finishes with GRPO reinforcement learning
  • Establishes a rigorous benchmark (AgriBench-VL-4K) with held-out images and disjoint data generation patterns to ensure objective evaluation
Architecture
Architecture Figure Figure 3
The Data Generator pipeline showing the flow from raw images to final instruction data via multi-agent refinement
Evaluation Highlights
  • AgriGPT-VL outperforms general-purpose models (InternVL-2-8B, Qwen2-VL-7B) on AgriBench-VL-4K, achieving higher pairwise win rates in LLM-as-a-judge evaluation
  • Maintains strong text-only performance on AgriBench-13K comparable to specialized text models, showing no degradation in language ability despite multimodal tuning
  • Ablation studies confirm consistent gains from each training stage, with GRPO refinement providing the final boost in reasoning accuracy
Breakthrough Assessment
8/10
Significant contribution to domain-specific AI. The scale of the dataset (3M) and the rigorous multi-agent data generation pipeline set a new standard for agricultural VLMs, moving beyond simple classification to complex reasoning.
×