AgroBench: Vision-Language Model Benchmark in Agriculture

📝 Paper Summary

Agricultural Vision-Language Benchmarks Crop and Disease Identification

AgroBench is a large-scale, expert-annotated benchmark for evaluating Vision-Language Models on diverse agricultural tasks, revealing that even state-of-the-art models struggle with fine-grained identification like weed detection.

Core Problem

Existing agricultural VLM benchmarks rely on synthetic data generated by models like GPT-4, lack expert validation, and cover limited categories, making it difficult to reliably assess real-world applicability.

Why it matters:

Reliable automated disease and pest identification is critical for sustainable food production and minimizing economic losses.
Synthetically generated benchmarks often contain hallucinations or inaccuracies that black-box models cannot verify, misleading researchers about true model performance.
Farmers need integrated systems that handle diverse tasks (from diagnosis to machinery usage), but current evaluations are narrow and fragmented.

Concrete Example: In weed identification, open-source VLMs perform near random guess levels (e.g., ~24-30% accuracy) because they lack fine-grained training on specific weed species, a failure obscured by simpler, less rigorous benchmarks.

Key Novelty

AgroBench (Agronomist AI Benchmark)

Comprehensive expert annotation: Unlike synthetic datasets, all 4,342 QA pairs are verified by human agronomists to ensure factual correctness.
State-of-the-art coverage: spans 7 distinct tasks (including Disease Management and Machine Usage) and covers 682 disease categories and 203 crop types—significantly more than prior work.
Real-world focus: Prioritizes images from real farm settings over lab conditions and includes complex reasoning tasks like 'Machine Usage' and 'Traditional Management' often ignored in vision-only datasets.

Architecture

Overview of AgroBench statistics and scope compared to previous datasets

Evaluation Highlights

GPT-4o achieves the highest overall accuracy of 79.26%, outperforming human baselines (36.79% on a subset) and open-source models.
Weed Identification (WID) is the hardest task: Open-source models like LLaVA-Next-8B score only 30.05%, barely above random chance (17.90%).
Chain-of-Thought (CoT) prompting provides marginal gains, boosting overall accuracy from 70.57% to 73.71% in one-shot settings but saturating quickly.

Breakthrough Assessment

9/10

Sets a new standard for agricultural VLM benchmarking with expert-verified data and massive category coverage, exposing significant gaps in current model capabilities.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice Visual Question Answering (VQA) across 7 agricultural domains

Inputs: An image I of a crop, pest, weed, or machine, and a natural language question q with 5 candidate answers

Outputs: The predicted correct answer option (A, B, C, D, or E)

Pipeline Flow

Input: Image + Question + 5 Choices
VLM Processing (Zero-shot or Few-shot CoT)
Output: Selected Answer Option

System Modules

VLM Evaluator

Process visual and textual input to select the correct answer from options

Model or implementation: Various (e.g., GPT-4o, LLaVA-Next, Qwen-VL)

Modeling

Base Model: Evaluates multiple models: GPT-4o, Gemini 1.5 Pro, Qwen-VL-72B, LLaVA-Next-72B, etc.

Training Method: Zero-shot and Few-shot (CoT) evaluation only (no training proposed)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Agri-LLaVA: AgroBench uses expert-verified human annotations rather than synthetic GPT-4 data
vs. AgroInstruct: AgroBench covers significantly more categories (682 diseases vs 74) and tasks (7 vs fewer)
vs. CDDM: AgroBench includes real-world tasks like Machine Usage and Traditional Management, not just disease diagnosis
+ 1 more
vs. PlantWild [not cited in paper]: PlantWild focuses on weak supervision from web images; AgroBench focuses on expert-curated evaluation data

Reproducibility

Code: https://dahlian00.github.io/AgroBenchPage/

Dataset and code publicly available at https://dahlian00.github.io/AgroBenchPage/. Images are sourced from Creative Commons licenses or with permission. The exact prompts used for evaluation are provided in the methodology.

📊 Experiments & Results

Evaluation Setup

Multiple-choice Question Answering across 7 distinct agricultural tasks using accuracy as the primary metric.

Benchmarks:

AgroBench (Agricultural Visual Question Answering) [New]

Metrics:

Accuracy (Exact Match of answer choice)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing GPT-4o leading, with QwenVLM-72B as the best open-source model.
AgroBench (Overall)	Accuracy	19.11	79.26	+60.15
AgroBench (Overall)	Accuracy	19.11	72.45	+53.34
Task-specific performance highlights significant difficulty in Weed Identification compared to other tasks.
AgroBench (Weed Identification - WID)	Accuracy	17.90	55.17	+37.27
AgroBench (Disease Identification - DID)	Accuracy	21.77	64.18	+42.41
AgroBench (Text-Only Input)	Overall Accuracy	19.11	29.71	+10.60

Experiment Figures

Effect of Chain of Thought (CoT) prompting on accuracy across 7 tasks

Error analysis breakdown for GPT-4o across tasks

Main Takeaways

Closed-source models (GPT-4o, Gemini 1.5) generally outperform open-source models, but QwenVLM-72B is a competitive open alternative.
Weed Identification is the most challenging task, with many models performing near random chance, indicating a data gap in training corpora.
Models perform better on Management tasks (DMN, CMN) than Identification tasks (DID, PID), suggesting they are better at reasoning with general knowledge than fine-grained visual recognition.
Chain-of-Thought (CoT) prompting provides a small boost, particularly in difficult tasks like Weed Identification, but benefits plateau after one or two shots.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and Visual Question Answering (VQA)
Basic knowledge of agricultural taxonomy (crops, diseases, pests, weeds)
Familiarity with zero-shot and few-shot evaluation protocols

Key Terms

VLM: Vision-Language Model—an AI model capable of processing and understanding both image and text inputs to perform tasks like description or question answering

CoT: Chain of Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before producing the final answer

DID: Disease Identification—task of diagnosing plant diseases from images

PID: Pest Identification—task of classifying insect pests affecting crops

WID: Weed Identification—task of identifying weed species, often within a specific bounding box

CMN: Crop Management—task involving decisions on irrigation, fertilization, and harvest timing based on visual crop status

DMN: Disease Management—task recommending treatments or interventions for identified plant diseases

MQA: Machine Usage QA—task regarding the correct operation and selection of agricultural machinery

TM: Traditional Management—task covering sustainable and traditional farming practices like agroforestry