Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Vision-Centric Reasoning Visual Grounding

Argus improves multimodal reasoning by explicitly predicting relevant image regions (bounding boxes) as intermediate steps and re-processing those regions to focus the model's attention.

Core Problem

Current MLLMs struggle with vision-centric tasks requiring precise focus because they rely on implicit, global attention rather than explicit goal-directed visual search.

Why it matters:

Standard unified architectures process whole images indiscriminately, missing small or specific details needed for spatial relationships or property identification
Implicit attention mechanisms in LLMs lack the conscious top-down control seen in human cognitive visual intelligence (goal-directed attention)
Existing methods that use bounding boxes often keep them as text coordinates without actively re-engaging the visual features of those regions for better perception

Concrete Example: When asked about the spatial relationship between two specific objects in a crowded scene, a standard MLLM might process the entire image globally and halluncinate. Argus first predicts the bounding boxes of the relevant objects (grounding), then re-samples visual tokens from those specific boxes to generate the final answer.

Key Novelty

Grounded Visual Chain-of-Thought (Visual CoT)

Intermediate Grounding Step: The model is trained to first output text-based bounding boxes ([xmin, ymin, xmax, ymax]) for regions relevant to the user's query before answering.
Visual Re-engagement: These boxes are used to explicitly fetch specific visual features—either by cropping and re-encoding or by re-sampling cached tokens—forcing the model to 'look closer' at the relevant areas.

Architecture

The overall architecture of Argus, illustrating the two-pass inference process: initial encoding, RoI prediction (grounding), and visual re-engagement.

Evaluation Highlights

Achieves state-of-the-art results on MMVP (Vision-centric Perception) with a score of 62.7, surpassing proprietary Gemini 1.5 Pro (61.3).
Outperforms comparably sized open-source models (e.g., Eagle-X3-8B) on the V-Star benchmark (small object perception) by +4.1% (54.6 vs 50.5).
Demonstrates strong dual capability: competitive on general reasoning while achieving high accuracy on referring grounding (85.24 on RefCOCO val), surpassing specialist models like Shikra.

Breakthrough Assessment

8/10

Strong conceptual advance by bridging grounding and reasoning via explicit visual re-engagement. Outperforms proprietary models on specific vision-centric benchmarks while maintaining generalist capabilities.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Visual Question Answering and Referring Expression Grounding

Inputs: Image I and text prompt/question Q

Outputs: Text response A, potentially containing bounding box coordinates as intermediate reasoning steps

Pipeline Flow

Visual Encoding (MoVE) → Initial Visual Tokens
First-Pass LLM Processing → RoI Prediction
Visual Re-engagement (Re-sampling or Re-encoding) → RoI Tokens
Second-Pass LLM Processing → Final Answer

System Modules

Visual Encoder Suite (MoVE)

Convert input image into visual tokens using multiple experts

Model or implementation: Mixture of CLIP (ViT-L/14), ConvNeXt-XXL-1024, and EVA-02-L/16

Region-of-Interest (RoI) Sampler

Predict bounding boxes for relevant image regions based on the text prompt

Model or implementation: Llama-3-8B (part of the main LLM)

Visual Context Re-engagement Module

Extract visual features specific to the predicted RoI to reinforce attention

Model or implementation: Re-sampling (token retrieval) OR Re-encoding (new pass through encoder)

Answer Generator

Generate the final text response using original context plus re-engaged visual tokens

Model or implementation: Llama-3-8B

Novel Architectural Elements

Explicit Visual Re-engagement Loop: A feedback mechanism where LLM outputs (boxes) are used to query the visual encoder/feature map again before final generation
Re-sampling Strategy for CoT: Using intersection-based retrieval of cached tokens to represent RoIs without computational overhead of re-encoding

Modeling

Base Model: Llama-3-8B

Training Method: Two-stage training: Alignment/Pre-training then Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning of Vision Encoders, MLP Projectors, and LLM Decoder during SFT

Trainable Parameters: Full parameter updates for MoVE encoder, Projectors, and LLM during SFT

Training Data:

Pre-training: LLaVA-595K
SFT: Eagle 1.8M (conversational), VCoT dataset (grounding/reasoning), GRIT (756K), Shikra (326K)

Key Hyperparameters:

learning_rate_pretraining: 1e-3
learning_rate_sft: 2e-5
batch_size: 256
+ 2 more
optimizer: AdamW
visual_token_count: 1024 (32x32)

Compute: NVIDIA A100 GPUs (number and time not explicitly reported)

Comparison to Prior Work

vs. Eagle: Argus adds the explicit grounded CoT loop (predict box -> re-sample features -> answer), whereas Eagle processes the whole image in one pass.
vs. Shikra: Shikra treats grounding as an output task; Argus treats grounding as an intermediate reasoning step to improve QA performance.
vs. Ferret [not cited in paper]: Ferret inputs hybrid region features; Argus generates region features dynamically via internal reasoning steps.

Limitations

Re-encoding small regions is computationally expensive compared to re-sampling.
Reliance on text-based coordinate prediction might be less precise than specialized detection heads.
Performance depends heavily on the quality of the initial grounding (box prediction); if the box is wrong, the re-engagement might mislead the model.

Reproducibility

Code: https://yunzeman.github.io/argus/

Project page available at https://yunzeman.github.io/argus/. Code and models are promised but specific github repo link inside paper is a placeholder. VCoT dataset construction described in detail.

📊 Experiments & Results

Evaluation Setup

Evaluation on diverse benchmarks covering general multimodal reasoning, vision-centric perception, and referring expression grounding.

Benchmarks:

MMVP (Vision-centric Perception (Visual Patterns))
V-Star (Visual Search / Small Object Recognition)
RefCOCO/+/g (Referring Expression Grounding)
CV-Bench (2D/3D Vision-Centric Reasoning)
MMMU/MMBench/SEED (General Multimodal Reasoning)

Metrics:

Accuracy (%)
Acc@0.5 (for Grounding)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Argus demonstrates superior performance on Vision-Centric Benchmarks compared to baselines of similar scale.
MMVP	Accuracy	56.0	62.7	+6.7
V-Star	Accuracy	50.5	54.6	+4.1
CV-Bench (2D)	Accuracy	78.4	81.9	+3.5
Argus performs competitively on general multimodal benchmarks.
MMMU (Val)	Accuracy	46.6	47.7	+1.1
Referring Expression Grounding results show Argus rivals specialist models.
RefCOCO (val)	Acc@0.5	82.63	85.24	+2.61
Ablation studies confirm the value of explicit visual engagement over implicit methods.
MMVP	Accuracy	54.7	58.7	+4.0

Experiment Figures

Visualization of attention maps comparing standard stimulus-driven attention vs. Argus's goal-directed attention.

Schematic comparison of Re-encoding vs. Re-sampling strategies for visual engagement.

Main Takeaways

Explicit Visual Re-engagement (Re-sampling/Re-encoding) consistently outperforms Implicit Guidance (just predicting boxes) across vision-centric tasks.
Re-sampling is generally superior to Re-encoding due to preserving positional context, except for tasks involving very small objects (V-Star) where Re-encoding (zooming/cropping) helps.
The 'Grounded CoT' approach allows a generalist MLLM to perform specialist-level grounding tasks while improving reasoning accuracy.
Argus bridges the gap between 'Stimulus-driven' (bottom-up) and 'Goal-directed' (top-down) attention in MLLM architectures.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention)
Multimodal Large Language Models (MLLM) architecture (ViT + MLP + LLM)
Visual Grounding (Text-to-Box)
Chain-of-Thought (CoT) prompting

Key Terms

Visual CoT: Visual Chain-of-Thought—using intermediate visual outputs (like bounding boxes) as reasoning steps to guide the final answer

RoI: Region-of-Interest—a specific rectangular area within an image that is relevant to the current task

MoVE: Mixture-of-Vision-Experts—combining multiple vision encoders (e.g., CLIP, ConvNeXt, EVA-02) to capture different types of visual information

Re-sampling: Extracting and reusing existing visual tokens from the feature map corresponding to a bounding box, preserving positional context without re-running the encoder

Re-encoding: Cropping the image based on a bounding box, resizing/padding it, and passing it through the vision encoder again as a new image

Grounding: The task of linking textual concepts (e.g., 'the red ball') to specific spatial regions (bounding boxes) in an image

Stimulus-driven attention: Automatic bottom-up attention driven by salient objects in the image (represented by the initial image tokenization)

Goal-directed attention: Top-down conscious selection of attention driven by user intent (represented by the language-guided RoI selection)