Reinforcing Video Reasoning with Focused Thinking

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Reinforcement Learning (RL) for Reasoning Video Question Answering

TW-GRPO improves video reasoning by weighting tokens based on information entropy to focus thinking and using multi-choice soft rewards to distinguish partial correctness.

Core Problem

Current RL-based multimodal reasoning produces verbose, unfocused chains of thought and relies on sparse binary rewards that fail to credit partially correct answers.

Why it matters:

Verbose reasoning often obscures critical spatio-temporal cues, leading to inefficiency and hallucination (overthinking)
Binary rewards (0 or 1) cause high variance during training because they cannot distinguish between a 'close' answer and a completely wrong one, hindering stable policy updates

Concrete Example: In a video QA task, if the ground truth is {B, D} and the model predicts {B}, standard methods give 0 reward (incorrect). TW-GRPO gives a partial reward (0.5), acknowledging the correct component.

Key Novelty

Token-Weighted Group Relative Policy Optimization (TW-GRPO)

Uses intra-group information entropy to identify and upweight 'informative' tokens (those where candidate responses diverge) while downweighting generic filler phrases like 'Let's think'
Reformulates single-choice QA into multi-choice tasks with 'soft rewards' based on set overlap (IoU-like), allowing the model to learn from partially correct predictions

Architecture

Overview of the TW-GRPO framework illustrating the flow from policy sampling to loss computation.

Evaluation Highlights

Achieves 50.4% accuracy on CLEVRER, outperforming the Video-R1 baseline by +18.8%
Surpasses Video-R1 by +1.6% on MMVU and +1.8% on NExT-GQA benchmarks
Significantly reduces reward variance during training compared to standard single-choice binary reward baselines

Breakthrough Assessment

7/10

Significant improvement on complex reasoning benchmarks (CLEVRER). The shift to soft rewards for QA and entropy-based token weighting is a clever, methodologically sound refinement of GRPO.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering (Video-QA) formulated as a multi-choice classification problem

Inputs: Video content v and question q

Outputs: Reasoning chain followed by a selected set of answer options P

Pipeline Flow

Policy Sampling (Generate G responses)
Token Weighting (Calculate entropy-based weights)
Reward Calculation (Compute multi-level soft rewards)
Policy Update (GRPO with weighted loss)

System Modules

Policy Model

Generates G candidate reasoning chains and answers for the input video/question

Model or implementation: Video-LLaMA2-7B (implied from context of baselines/tasks)

Token Weighting Mechanism

Calculates importance weights w_t for each token position based on KL divergence from the mean distribution

Model or implementation: Mathematical function (Entropy/KL calculation)

Reward Model

Computes soft rewards based on set overlap between predicted answer set P and ground truth G

Model or implementation: Rule-based function

Novel Architectural Elements

Integration of dynamic token-level importance weighting directly into the GRPO loss function
Question-Answer Inverse (QAI) module for on-the-fly conversion of single-choice tasks to multi-choice training samples

Modeling

Base Model: Video-LLaMA2

Training Method: Token-Weighted Group Relative Policy Optimization (TW-GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while prioritizing informative tokens.

Formally: Loss includes a token weight w_t term in the GRPO objective: J = E [ 1/G sum (w_t * A_i * ratio - KL_penalty) ]
Purpose: Define token importance.

Formally: w_t = min-max-norm( D_KL( P(o_t) || E[P(o_t)] ) ) + (1+alpha)
Purpose: Define multi-level soft reward.

Formally: R = |P intersect G| / |G| if P subset G, else 0

Training Data:

Uses Question-Answer Inverse (QAI) to augment standard single-choice datasets (e.g., NExT-GQA) into multi-choice formats by negating questions and inverting answer sets

Key Hyperparameters:

group_size_G: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
alpha: Hyperparameter controlling scaling of token importance (value not explicitly listed)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Video-R1: TW-GRPO adds token-level weighting and multi-level rewards; Video-R1 uses sequence-level GRPO and binary rewards
vs. VideoChat-R1: TW-GRPO applies soft rewards to QA tasks via multi-choice reformulation, whereas VideoChat-R1 restricts soft rewards to grounding tasks
vs. LLaVA-OneVision [not cited in paper]: TW-GRPO focuses on RL-based post-training for reasoning, while LLaVA-OneVision focuses on architectural unification and SFT

Reproducibility

Code: https://github.com/longmalongma/TW-GRPO

Code is publicly available at https://github.com/longmalongma/TW-GRPO. The paper describes the QAI augmentation logic and the reward formulas clearly. Specific hyperparameters like learning rate, batch size, and alpha are not explicitly detailed in the text.

📊 Experiments & Results

Evaluation Setup

Video Question Answering on complex reasoning benchmarks

Benchmarks:

CLEVRER (Causal and counterfactual video reasoning)
NExT-GQA (Grounded video QA)
MMVU (Multi-discipline video understanding)

Metrics:

Accuracy (Acc)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
State-of-the-art comparisons showing significant gains over the direct baseline Video-R1 and other recent models.
NExT-GQA	Accuracy	72.4	74.2	+1.8
MMVU	Accuracy	64.2	65.8	+1.6

Experiment Figures

Comparison of reward standard deviation during training for Single-choice (binary), Multi-choice (binary), and Multi-choice (soft reward).

Main Takeaways

Token weighting effectively suppresses verbose, generic reasoning patterns (e.g., 'Let's think'), leading to more concise and task-focused outputs
Multi-level soft rewards significantly stabilize training by reducing reward variance compared to binary rewards, especially in multi-choice settings
Question-Answer Inverse (QAI) augmentation is critical for enabling multi-choice training on datasets that are natively single-choice

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Multimodal Large Language Models (MLLMs)
Information Entropy / KL Divergence
Chain-of-Thought (CoT) Reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, removing the need for a critic model

TW-GRPO: Token-Weighted Group Relative Policy Optimization—the proposed method that adds token weighting and soft rewards to GRPO

KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to quantify how much a token distribution differs from the average, serving as a proxy for information density

QAI: Question-Answer Inverse—a data augmentation technique that negates questions (e.g., 'did' -> 'didn't') and inverts answers to create multi-answer samples from single-choice datasets

soft reward: A continuous reward signal (0 to 1) proportional to the correctness of the answer (e.g., Intersection over Union), rather than a binary 0/1 signal

intra-group information entropy: A measure of uncertainty or variation among a group of generated responses at a specific token position; high variation implies the token is a critical decision point