ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

📝 Paper Summary

RL-based Tool Use Tool-use post-training

ResT stabilizes reinforcement learning for tool-use agents by reshaping policy gradients based on token entropy, prioritizing structured tokens (names, parameters) early and reasoning tokens later via a curriculum.

Core Problem

Training tool-use agents with standard RL is inefficient because sparse outcome rewards cause high gradient variance, and rule-based rewards often neglect reasoning tokens.

Why it matters:

Current RL methods for tool use suffer from high variance due to the 'needle-in-a-haystack' nature of critical tokens (tool names/args) amidst general text
Uniformly treating all tokens dilutes reward signals, making credit assignment difficult for multi-turn tasks where reasoning is crucial but sparse
Existing systems are sample-inefficient and computationally expensive due to multi-turn rollouts

Concrete Example: In a tool-call task, a model might generate a long reasoning chain but fail slightly on a parameter name. Standard outcome rewards give a zero score, providing no signal on which part failed. ResT identifies low-entropy tokens (the parameter name) as critical and weights their gradient updates higher to fix the specific error.

Key Novelty

Entropy-Aware Token-Level Policy Gradient Reshaping (ResT)

Theoretically links lower token entropy to reduced policy-gradient variance, identifying structured tokens (tool names, parameters) as the most reliable reward carriers
Reshapes the standard policy gradient by weighting tokens based on their region-level entropy (e.g., higher weights for strict formats, lower for open reasoning)
Applies a curriculum that dynamically shifts weights: initially prioritizing format/syntax correctness, then gradually upweighting reasoning tokens as training progresses and their entropy stabilizes

Architecture

Conceptual overview of ResT. It shows the decomposition of multi-turn tool use into single turns, the entropy-based reshaping of policy gradients, and the curriculum that shifts focus from format to reasoning.

Evaluation Highlights

Outperforms prior SOTA method (ToolRL/GRPO) by up to 8.76% on the Berkeley Function Calling Leaderboard (BFCL)
Surpasses GPT-4o by 4.11% on single-turn tool-use tasks and 1.50% on multi-turn base tasks when fine-tuning a 4B parameter model (Qwen3-4B)
Curriculum-based reshaping improves performance by up to 4.86% compared to static reward weighting strategies

Breakthrough Assessment

8/10

Strong theoretical grounding for token-level weighting in RL, directly addressing the high-variance problem in tool use. Achieving GPT-4o level performance with a 4B model is a significant practical milestone.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn tool-use modeled as a sequential decision process where a policy generates tokens for text and tool calls.

Inputs: User query x and conversation history

Outputs: Response sequence y consisting of reasoning text and structured tool invocations

Pipeline Flow

Decomposition: Split multi-turn dialogue into single-turn instances
Generation: Policy generates response (reasoning + tool calls)
Rewarding: Compute rule-based rewards (format, name, params)
Reshaping: Reweight gradients using region-level entropy maps
Update: Apply curriculum to shift focus from structure to reasoning

System Modules

Trajectory Decomposer

Converts multi-turn interaction logs into single-turn training samples to densify reward signals

Model or implementation: Algorithmic logic (SWiRL framework)

Policy Model

Generates tool calls and reasoning steps

Model or implementation: Qwen3-4B-2507 / Llama-3 (Base models)

Entropy-Aware Reweighter

Calculates token-specific weights based on average entropy of token regions (format, name, params, thought)

Model or implementation: Mathematical formula (Eq 9 & 15)

Novel Architectural Elements

Gradient reshaping mechanism using region-level entropy statistics to modulate RL updates per token
Curriculum-driven weight scheduler that synchronizes reasoning (CoT) and parameter weights based on training progress

Modeling

Base Model: Qwen3-4B-2507 (primary), also tested on Llama-3-8B-Instruct

Training Method: ResT (Reshaped Token-level Policy Gradients) on top of single-turn RL

Objective Functions:

Purpose: Maximize expected return with reduced variance via reshaped gradients.

Formally: ∇J(θ) ≈ Σ (w_t * A_t * ∇log π(y_t|x))
Purpose: Encourage format compliance and correctness.

Formally: Reward R = α * Format_Score + β * Correctness_Score (Tool Name + Params)

Adaptation: Full fine-tuning

Training Data:

Curated 4k mixed corpus: ToolACE (2k), Hammer (1k), XLAM (1k)
Multi-turn data decomposed into single-turn samples

Key Hyperparameters:

delta: Numerical stability constant (value not explicitly listed in text, just symbol)
nu_bar: Training progress indicator in (0,1)
epsilon: Gradient clipping parameter

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolRL: ResT uses token-level gradient reshaping instead of sequence-level rewards
vs. GRPO: ResT incorporates entropy-based reweighting and a curriculum to handle high-variance tokens differently
vs. DeepSeek-R1 [not cited in paper]: DeepSeek-R1 uses GRPO with rule-based rewards; ResT explicitly modifies the gradient estimator based on token entropy to reduce variance

Limitations

Relies on rule-based rewards which may not capture all nuances of open-ended tool use
Requires decomposing multi-turn data into single-turns, which might lose some long-horizon dependency information
Entropy calculation adds slight computational overhead during training

Reproducibility

Code availability is not provided. The paper describes the algorithm (Algo 1) and reward formulas in detail. Datasets (ToolACE, Hammer, XLAM) are public. Base models (Qwen, Llama) are public.

📊 Experiments & Results

Evaluation Setup

Tool-use capability assessment using standardized benchmarks

Benchmarks:

Berkeley Function Calling Leaderboard (BFCL) (Multi-step tool use / function calling)
API-Bank (Multi-turn tool invocation in dialogue)

Metrics:

Accuracy (AST evaluation)
Execution Correctness
Format Compliance
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ResT achieves state-of-the-art results on the Berkeley Function Calling Leaderboard (BFCL), outperforming baselines and even larger proprietary models.
BFCL	Overall Accuracy	78.43	87.19	+8.76
BFCL	Single-turn Tool Use	86.08	90.19	+4.11
BFCL	Multi-turn Base	90.00	91.50	+1.50
Ablation studies confirm the effectiveness of the dynamic curriculum strategy.
BFCL	Overall Accuracy	82.33	87.19	+4.86

Experiment Figures

Entropy distribution across different token regions (Format, Name, Parameters, Thought) and how weights are assigned.

Main Takeaways

Entropy-based reweighting significantly reduces policy gradient variance, leading to more stable and effective training.
The curriculum strategy of moving from structure (low entropy) to reasoning (high entropy) aligns well with the learning dynamics of tool-use agents.
Decomposing multi-turn trajectories into single-turn training samples provides denser reward signals without sacrificing multi-turn capability.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Large Language Models (Tokenization, Logits)
Entropy (Shannon Entropy)

Key Terms

ResT: Reshaped Token-level policy gradients—the proposed method that reweights gradients based on token entropy

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of samples to reduce variance

Policy Entropy: A measure of the randomness in the model's token predictions; lower entropy implies higher confidence/structure

BFCL: Berkeley Function Calling Leaderboard—a benchmark for evaluating LLMs' ability to call functions/tools correctly

Jaccard similarity: A metric used here to measure the overlap between predicted tool names/parameters and ground truth sets