TGPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLMs

📝 Paper Summary

LLM Self-Debugging Iterative Refinement Code Generation

TGPR improves code debugging by using a Thompson Sampling-guided tree search during training to generate diverse refinement trajectories, allowing the model to learn from both successes and failures without expensive search at test time.

Core Problem

Standard reinforcement learning for code refinement (like GRPO) suffers from inefficient exploration, often getting stuck in local optima because it relies solely on the policy's own limited sampling to find fixes.

Why it matters:

Single-pass code generation frequently fails on complex algorithms or subtle bugs, necessitating iterative repair strategies
Existing refinement methods use fixed heuristics or myopic RL that cannot effectively navigate the vast search space of possible code edits
Inefficient exploration leads to models that cannot fix subtle semantic errors or algorithmic flaws, limiting their utility in real-world software development

Concrete Example: When a model generates code with a subtle boundary error (e.g., off-by-one loop), a standard RL agent might try random syntax changes that fail or only fix the syntax without solving the logic. TGPR's tree search would explore a branch where the loop condition is modified, identify it as a high-reward path via Thompson Sampling, and use that trajectory to train the policy.

Key Novelty

Training-Time Tree-Guided Exploration

Integrates a Thompson Sampling-guided search tree into the GRPO training loop to actively manage exploration and exploitation of code refinements
Uses the tree search strictly as a data generation engine during training to create high-quality trajectories (including informative failures), allowing the final model to perform single-shot refinement at inference without the computational cost of the tree

Architecture

The TGPR framework architecture showing the interaction between the Policy Model, Reward Model, and the Thompson Sampling-guided Tree during training.

Evaluation Highlights

+12.51 percentage points improvement in pass@10 on the APPS benchmark compared to the GRPO baseline
+4.2 percentage points improvement in pass@1 on MBPP compared to GRPO
Achieved lowest error rates across all categories (Semantic, Algorithmic, Performance) compared to PPO and GRPO baselines

Breakthrough Assessment

7/10

Strong empirical results (+12.5pp on APPS) and a principled approach to the exploration problem in RL HF (RL from Human/Heuristic Feedback). The idea of using tree search for training-data augmentation rather than inference is clever and efficient.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for iterative code refinement

Inputs: Current code program s (including initial faulty version and feedback)

Outputs: Refinement action a (sequence of tokens modifying the code)

Pipeline Flow

Input: Faulty Code + Feedback
Policy Model (Refinement)
Output: Refined Code

System Modules

Policy Model

Generate code repairs based on the current state (code + feedback)

Model or implementation: Qwen-7B (Fine-tuned)

Novel Architectural Elements

Integration of a Thompson Sampling-guided tree search specifically as a trajectory generator for the GRPO training loop (not an inference module)

Modeling

Base Model: Qwen-7B

Training Method: Group Relative Policy Optimization (GRPO) augmented with Thompson Sampling Tree Search

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: GRPO objective using group-normalized advantages A_{i,t} = (r_i - mean(r)) / std(r)
Purpose: Guide exploration and reward policy.

Formally: R(rho) = alpha * CodeBLEU(rho, rho_c) + (1-alpha) * (|Tp(rho)| / |T|), combining semantic similarity and functional correctness

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 32
rollout_batch_size: 8
+ 6 more
parallel_environments: 256
epochs: 5
clip_epsilon_low: 0.2
clip_epsilon_high: 0.3
temperature_training: 1.0
temperature_evaluation: 0.6

Compute: Single server with A100 GPUs (80GB VRAM)

Comparison to Prior Work

vs. GRPO: TGPR uses tree search for data augmentation/exploration during training, whereas GRPO relies on on-policy sampling
vs. REx: TGPR uses Thompson Sampling for training-time trajectory generation to train a policy, while REx uses it for test-time inference guidance
vs. LeDex: TGPR integrates the structured search directly into the RL loop via GRPO rather than separate SFT/RL stages for explanations

Limitations

Computational cost of tree search during training is higher than standard GRPO due to maintaining the tree structure and multiple rollouts
Relies on a custom hybrid reward function (CodeBLEU + tests), which requires a reference solution or test suite availability
Evaluated primarily on Python code generation benchmarks; generalization to other languages or domains is untested
No statistical significance tests reported for the performance improvements

Reproducibility

Code availability is not explicitly provided in the text. Training relies on Hugging Face ecosystem. Benchmarks (MBPP, APPS, CodeContests) are standard. Detailed hyperparameters (LR, batch size, clip params) are provided.

📊 Experiments & Results

Evaluation Setup

Iterative code refinement on standard programming benchmarks

Benchmarks:

MBPP (Python programming problems)
HumanEval (Python coding problems)
APPS (Complex algorithmic coding problems)

Metrics:

pass@1
pass@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TGPR outperforms the GRPO baseline across all three benchmarks, with the most significant gains on the complex APPS dataset.
MBPP	pass@1	26.8	31.0	+4.2
MBPP	pass@10	52.1	56.3	+4.2
HumanEval	pass@1	22.4	25.1	+2.7
HumanEval	pass@10	43.6	49.8	+6.2
APPS	pass@1	15.1	18.9	+3.8
APPS	pass@10	34.2	46.71	+12.51

Experiment Figures

Distribution of error categories (Syntax, Semantic, Algorithmic, etc.) for Pre-trained LLM, GRPO, PPO, and TGPR.

Main Takeaways

TGPR consistently outperforms standard GRPO and PPO baselines across varying difficulty levels (MBPP to APPS)
The method is particularly effective for complex tasks (APPS), suggesting the tree search helps navigate difficult algorithmic search spaces
Error analysis shows TGPR significantly reduces Algorithmic Design Flaws and Semantic Errors compared to baselines, indicating deeper reasoning capabilities
The approach effectively internalizes exploration strategies into the policy, as the tree search is removed at inference time

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient methods)
Code Generation Benchmarks (MBPP, APPS)
Bayesian Probability (Beta distribution)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs to optimize the policy without a learned critic

Thompson Sampling: A heuristic for choosing actions that addresses the exploration-exploitation dilemma by sampling from a probability distribution describing the expected reward of each action

CodeBLEU: A metric for code evaluation that considers syntactic and semantic similarity (data flow, structure) rather than just n-gram matching

pass@k: A metric measuring the probability that at least one of the top k generated code samples passes all unit tests

Beta distribution: A continuous probability distribution bounded between 0 and 1, often used in Bayesian inference to model the probability of success (used here for Thompson Sampling)

data augmentation: The process of artificially increasing the diversity and size of training data; here, the tree search generates diverse debugging paths for the model to learn from

AdamW: A stochastic optimization method that modifies the typical implementation of weight decay in Adam, decoupling it from the gradient update