← Back to Paper List

TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

Lang Cao, Hui Ruan, Yongqian Li, Peng Chao, Wu Ning, Haonan Song, Renhong Chen, Yitong Li
Huawei Technologies Co., Ltd.
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for LLMs Reasoning Alignment
TreeAdv improves reasoning efficiency by explicitly modeling rollouts as trees and assigning advantages to individual tokens based on their contribution to successful branches rather than sequence-level outcomes.
Core Problem
Standard group-based RL (GRPO) assigns a single scalar reward to an entire generated sequence, reinforcing verbose or redundant reasoning steps equally with useful ones.
Why it matters:
  • Sequence-level rewards fail to distinguish critical reasoning steps from irrelevant ones, leading to noisy optimization signals
  • Models develop a length bias, generating long, redundant chains of thought because verbose trajectories are rewarded just as highly as concise ones if the final answer is correct
Concrete Example: If a model generates a 1000-token chain where the first 900 tokens are a wandering detour but the final 100 solve the problem, GRPO rewards the detour tokens equally. TreeAdv would identify the detour as having lower value compared to more direct branches.
Key Novelty
Tree-Structured Advantage Redistribution
  • Constructs rollout trees by branching only at high-uncertainty tokens (high entropy) while sharing prefixes for low-uncertainty segments, reducing redundancy
  • Calculates token-level advantages by aggregating rewards from all leaf nodes (completed rollouts) that share a specific token, effectively performing Monte Carlo estimation on the tree topology
Evaluation Highlights
  • Outperforms GRPO on Qwen3-8B-Inst average accuracy (61.99% vs 60.55%) across Olympiad-level benchmarks
  • Reduces generation length by ~23% (15,693 to 12,073 tokens) on Qwen3-8B-Inst while improving accuracy, mitigating the length bias of standard RL
  • +4% accuracy improvement on OlymH benchmark (27% vs 23%) using TreeAdv-GRPO compared to baseline GRPO
Breakthrough Assessment
7/10
Offers a logical evolution of GRPO by integrating tree search concepts directly into the training objective. Significant efficiency gains (shorter outputs) are a strong practical benefit.
×