Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

📝 Paper Summary

Self-Correction Reinforcement Learning (RL) Post-training

Reflect, Retry, Reward improves LLM performance on verifiable tasks by using Group Relative Policy Optimization (GRPO) to train models to generate effective self-reflections upon failure, rewarding only the reflection tokens that lead to a successful retry.

Core Problem

LLMs often fail at verifiable tasks (math, coding) and standard self-reflection prompts are static and often ineffective, while fine-tuning is impossible without specific failure-correction datasets.

Why it matters:

Models have blind spots where they fail despite having the necessary knowledge, and simply retrying without better guidance yields diminishing returns
Generating synthetic training data is impossible if even state-of-the-art models fail the task
Existing self-correction methods rely on high-quality prompts or external teachers, which do not scale or adapt to specific model weaknesses

Concrete Example: In the Countdown math game, a model might fail to reach a target number using a list of integers. A standard retry might repeat the same error. The proposed method prompts the model to reflect ('I used the wrong operation order'), and if this reflection leads to a correct equation in the next attempt, the system reinforces the generation of that specific reflection.

Key Novelty

Reflection-Targeted Reinforcement Learning (Reflect, Retry, Reward)

Treats self-reflection as a learnable policy rather than a fixed prompt: the model learns *how* to reflect on its own mistakes
Applies rewards strictly to the 'reflection' tokens (masking the answer tokens) via GRPO, ensuring the model optimizes the reasoning process that fixes errors rather than just memorizing answers
Utilizes a 'Dataset of Failures' for training, bootstrapping improvement solely from binary success/failure signals without human annotations or teacher models

Architecture

The Reflect, Retry, Reward workflow. It shows the path for a failed query: Failure -> Generate Reflection -> Retry with Reflection -> Success -> Reward Reflection Tokens.

Evaluation Highlights

+34.7 percentage points improvement on the Countdown math task for Qwen2.5-1.5B-Instruct (23.6% -> 58.3%)
+18.5 percentage points improvement on APIGen function calling for Llama-3.1-8B-Instruct (67.9% -> 86.4%)
Small fine-tuned models (e.g., Qwen2.5-7B) outperform vanilla models 10x their size (e.g., Qwen2.5-72B) on these specific verifiable tasks

Breakthrough Assessment

8/10

Significant performance gains on verifiable tasks using a clever, efficient RL setup that requires only binary feedback. The 'masking rewards to target reflection' is a strong methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Iterative task solving with binary verification and self-reflection

Inputs: User query q and (optional) tool definitions or constraints

Outputs: Correct response y (function call or math equation) after optional reflection

Pipeline Flow

Attempt 1 Generator (Initial Try)
Verifier (Binary Success Check)
Reflection Generator (If Fail)
Attempt 2 Generator (Retry with Context)

System Modules

Attempt 1 Generator (Inference)

Generate initial response to the user query

Model or implementation: Base LLM (e.g., Llama-3.1-8B)

Verifier

Check correctness of response using ground truth or execution

Model or implementation: Deterministic Code/Function

Reflection Generator (Inference)

Generate analysis of why Attempt 1 failed

Model or implementation: Shared Base LLM (Trainable)

Attempt 2 Generator (Inference)

Generate refined response using reflection

Model or implementation: Shared Base LLM (Trainable)

Novel Architectural Elements

Conditional Reward Masking: In the RL update, advantages are calculated based on Attempt 2's success, but the gradient update is applied *only* to the Reflection tokens (Attempt 1 and Attempt 2 tokens are masked out)

Modeling

Base Model: Qwen2 (1.5B/7B), Llama-3.1 (8B), Phi-3.5-mini

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize the probability of generating reflections that lead to successful second attempts.

Formally: GRPO objective (maximizing advantage of reflection tokens given group outcomes).
Purpose: Penalize deviation from the reference model.

Formally: KL divergence term.

Adaptation: Full fine-tuning (implied by lack of LoRA mention)

Trainable Parameters: Full model (1.5B to 8B parameters)

Training Data:

Dataset of Failures: Generated by prompting models 64 times per query on training sets and keeping only failed instances.
APIGen failures (~25k unique queries)
Countdown failures (~15k unique problems)

Key Hyperparameters:

learning_rate: 5e-7
batch_size: 256 (effective)
kl_coefficient: 0.001
+ 2 more
scheduler: Cosine annealing with 0.03 warmup ratio
training_steps: Up to 1,750

Compute: 4 to 8 H100 GPUs

Comparison to Prior Work

vs. CoT: This method dynamically generates reflections *conditionally* on failure and optimizes the content of those reflections
vs. Reflexion: Does not require a stronger model (e.g., GPT-4) to generate critiques; bootstraps from the model itself
vs. STaR [not cited in paper]: STaR filters correct reasoning traces for SFT; this method uses RL (GRPO) on *failed* instances converted to success via reflection

Limitations

Requires a reliable binary verifier (oracle) to determine success/failure, which restricts applicability to verifiable tasks like math/code
Computational overhead during inference due to the two-step generate-reflect-retry process
Experiments limited to smaller models (up to 8B parameters) due to GRPO computational costs
Small models (0.5B - 1B) struggled to learn effective reflections, indicating a capability threshold

Reproducibility

Code availability is not provided. The paper describes the algorithm (Multi-step GRPO implementation extending TRL) and prompt templates in detail. Datasets (APIGen, Countdown) are public.

📊 Experiments & Results

Evaluation Setup

Verify model output against ground truth (Function Calling) or mathematical validity (Countdown)

Benchmarks:

APIGen (Function Calling / Tool Use)
Countdown (Mathematical Equation Generation)

Metrics:

Pass Rate / Accuracy (Binary Success)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
APIGen Function Calling results show significant improvement from the vanilla baseline (Step 1) to the proposed Reflect-Retry method (Step 2) after training.
APIGen	Accuracy	67.9	86.4	+18.5
APIGen	Accuracy	78.0	88.2	+10.2
APIGen	Accuracy	63.7	72.4	+8.7
Countdown Math Equation results demonstrate large gains, particularly for Qwen2.5 models.
Countdown	Accuracy	23.6	58.3	+34.7
Countdown	Accuracy	46.8	66.8	+20.0
Countdown	Accuracy	58.4	76.1	+17.7

Main Takeaways

Smaller models (1.5B-8B) trained with 'Reflect, Retry, Reward' can outperform vanilla models that are 10x larger on verifiable tasks.
The method is effective even when the model initially fails significantly (e.g., Qwen2.5-1.5B starting at 23.6% accuracy), suggesting strong self-correction potential.
Improvements are consistent across different architectures (Llama, Qwen, Phi) and different task types (Math, Function Calling).
Training on a 'Dataset of Failures' is an efficient way to focus the model on learning from its mistakes rather than reinforcing already-known behaviors.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Human Feedback (RLHF)
Group Relative Policy Optimization (GRPO)
Chain-of-Thought (CoT) prompting

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing outcomes of a group of outputs for the same input, removing the need for a separate critic network

Self-reflection: A meta-prompting strategy where the model analyzes its own previous output to identify errors before retrying

APIGen: A dataset for evaluating function calling capabilities, requiring models to generate correct API calls with valid parameters

Countdown: A math reasoning task where the model must use a set of numbers to reach a target value using basic arithmetic

Prefix Caching: An optimization technique in inference engines (like vLLM) that stores KV-caches of common prompt prefixes to speed up generation

Rejection Sampling: A method used here to create training data by generating multiple responses and keeping only the ones that fail (to train on failures) or succeed (to learn from)

Outcome-based RL: Reinforcement learning where the reward is determined solely by the final result (success/failure) rather than step-by-step supervision