Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

📝 Paper Summary

Referring Expression Comprehension (REC) Multimodal Large Language Models (MLLMs)

Rex-Thinker reformulates object referring as a retrieval task using explicit Chain-of-Thought reasoning to verify candidate objects step-by-step, significantly reducing hallucinations and improving interpretability.

Core Problem

Most referring expression models either directly regress coordinates or retrieve boxes implicitly, lacking interpretable reasoning and failing to reject expressions when no matching object exists (hallucination).

Why it matters:

Current black-box models are unverifiable; users cannot trace why a specific box was selected
High hallucination rates in existing models reduce reliability in real-world applications where targets might be missing
Direct coordinate regression struggles with complex reasoning tasks that require checking attributes step-by-step

Concrete Example: When asked to locate 'the person wearing a blue shirt' in an image where no such person exists, standard models often force a prediction on a random person. A grounded model should inspect each person, find no match, and explicitly output a rejection.

Key Novelty

Planning-Action-Summarization CoT for Object Referring

Reformulates referring as a retrieval process: an external detector provides candidate boxes (hints), and the MLLM reasons about each one
Structured CoT: The model explicitly plans subgoals, acts by checking specific box hints against the text, and summarizes findings to select the answer
Two-stage training: Cold-start SFT on a new CoT dataset followed by GRPO (Group Relative Policy Optimization) to reinforce correct reasoning paths

Architecture

The inference pipeline of Rex-Thinker comparing it to previous methods. It illustrates the Planning -> Action -> Summarization workflow.

Evaluation Highlights

+13.6% accuracy improvement on HumanRef-Reasoning subset compared to Chat-Rex-7B
Achieves 86.8% accuracy on HumanRef-Rejection subset, significantly outperforming baselines that struggle to abstain
Zero-shot generalization to RefCOCOg is strong (86.2% precision), and further fine-tuning with GRPO yields additional gains (+0.7%)

Breakthrough Assessment

8/10

Strong contribution in applying the 'thinking' paradigm (CoT + RL) to vision-language grounding. The construction of a dedicated CoT dataset and the demonstration of verifiable reasoning + rejection capability are significant steps forward.

⚙️ Technical Details

Problem Definition

Setting: Object Referring / Referring Expression Comprehension (REC) with potential target absence (rejection)

Inputs: Image I, Referring Expression x, Set of candidate bounding box hints B_cand

Outputs: Predicted bounding box subset B_pred (or empty set if no match) + Reasoning trace

Pipeline Flow

Object Detector (extracts candidate boxes)
Visual Prompting (overlays markers/IDs on image)
Rex-Thinker MLLM (generates reasoning trace)
Output Parsing (extracts final box selection)

System Modules

Object Detector

Propose all potential object candidates for the referred category

Model or implementation: Grounding DINO (implied/cited open-vocabulary detector)

Rex-Thinker

Perform step-by-step verification of candidates against the expression

Model or implementation: Qwen2.5-VL-7B (fine-tuned)

Novel Architectural Elements

Retrieval-based CoT formulation: The MLLM is explicitly restricted to selecting from input 'box hints' rather than regressing coordinates, enforcing groundedness.
Planning-Action-Summarization structure: A strict output format enforced during SFT that decomposes reasoning into checking specific box IDs.

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Two-stage: (1) Cold-start SFT, (2) GRPO-based Reinforcement Learning

Objective Functions:

Purpose: Supervised training to learn CoT format.

Formally: Standard Cross-Entropy Loss on reasoning tokens and answer tokens.
Purpose: Reinforcement learning to optimize accuracy and formatting.

Formally: GRPO objective maximizing a reward function R = λ * R_F1 + (1-λ) * R_fmt, where R_F1 rewards correct box selection (IoU=1 with hints) and R_fmt rewards correct tag structure.

Training Data:

HumanRef-CoT: 90,824 samples generated by prompting GPT-4o with HumanRef dataset images + box annotations + Set-of-Mark visual prompts.

Key Hyperparameters:

lambda (reward weight): 0.9
GRPO_epsilon: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Chat-Rex: Rex-Thinker adds explicit step-by-step CoT reasoning, improving interpretability and rejection performance
vs. Qwen2-VL/Shikra: Rex-Thinker uses a retrieval-based approach with 'box hints' rather than direct coordinate regression, making it more verifiable
vs. DeepSeek-R1 [not cited in paper]: Rex-Thinker adapts the 'thinking' process specifically for visual grounding by incorporating box hints into the reasoning chain

Limitations

Dependency on external object detector quality; if the detector misses the object, the model cannot recover.
Inference latency is likely higher due to the generation of long reasoning traces.
Requires category-specific candidate extraction, which might be ambiguous for some expressions.

Reproducibility

Code: https://github.com/IDEA-Research/Rex-Thinker

Code available at https://github.com/IDEA-Research/Rex-Thinker. HumanRef-CoT dataset construction method is described (GPT-4o prompting on HumanRef). Base model is Qwen2.5-VL-7B. Hyperparameters for GRPO (epsilon, LR, batch size) are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

Object Referring (REC) on human-centric and general object datasets.

Benchmarks:

HumanRef (Human-centric Referring Expression Comprehension (includes rejection))
RefCOCOg (General Object Referring Expression Comprehension)

Metrics:

Accuracy (Precision@0.5 implied, usually IoU>0.5)
Rejection Accuracy (True Negative Rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on HumanRef dataset showing superiority of CoT reasoning over direct prediction baselines, especially in reasoning-heavy and rejection scenarios.
HumanRef (Reasoning subset)	Accuracy	73.2	86.8	+13.6
HumanRef (Rejection subset)	Accuracy	76.4	86.8	+10.4
HumanRef (Overall)	Accuracy	77.5	88.1	+10.6
Generalization experiments on RefCOCOg showing that the model trained on HumanRef-CoT transfers well to general objects.
RefCOCOg (val)	Accuracy	86.2	86.9	+0.7

Experiment Figures

Illustration of the retrieval-based formulation. It shows how an image and query are processed via a detector to produce 'Box Hints', which are then fed into the MLLM along with the original inputs.

Main Takeaways

Chain-of-Thought reasoning significantly improves performance on complex queries (Attribute, Reasoning, Interaction subsets) compared to direct prediction.
The retrieval-based formulation with explicit box hints enables the model to effectively reject inputs where no candidate matches, addressing the hallucination problem.
GRPO-based reinforcement learning further refines the model's ability to reason and generalize beyond the initial supervised fine-tuning data.
The approach is verifiable: users can read the <think> block to see exactly which candidate box was checked and why it was accepted or rejected.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Multimodal Large Language Models (MLLMs)
Understanding of Object Detection and Grounding
Basic knowledge of Reinforcement Learning (PPO/GRPO)

Key Terms

REC: Referring Expression Comprehension—locating objects in an image based on a natural language description

CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies by comparing a group of outputs generated for the same input, removing the need for a separate critic model

SFT: Supervised Fine-Tuning—training a model on labeled examples (here, reasoning traces) to establish a baseline capability

IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box

Hallucination: In this context, when a model predicts an object exists and outputs a box for it, even though the object described does not exist in the image

Box Hints: Pre-detected bounding boxes provided to the model as visual prompts (e.g., with numbered markers) to ground the reasoning process

Cold Start: The initial SFT phase used to teach the model the desired output format before applying reinforcement learning