Self-Refine: Iterative Refinement with Self-Feedback

📝 Paper Summary

Iterative refinement Self-correction Feedback generation

SELF-REFINE enables a single Large Language Model (LLM) to iteratively improve its own outputs by generating feedback on its drafts and then refining them, without requiring additional training or supervised data.

Core Problem

LLMs often produce suboptimal initial outputs for complex tasks (like code optimization or dialogue), and existing refinement methods typically require expensive supervised training data or separate reward models.

Why it matters:

Training separate refinement models requires large, domain-specific datasets which are often unavailable.
Reinforcement learning approaches (RLHF) rely on costly human annotations or reward models.
Models like GPT-4 have latent capabilities to correct errors but don't utilize them in standard one-pass generation.

Concrete Example: When asking a model to 'Send me the data ASAP', a standard model might output just that. A refined model should recognize this is impolite, generate feedback ('potential impoliteness'), and rewrite it as 'Hi Ashley, could you please send me the data at your earliest convenience?'.

Key Novelty

Iterative Feedback-Refinement Loop using a Single LLM

Uses the *same* LLM to act as the Generator, Feedback Provider, and Refiner.
Alternates between two steps: (1) FEEDBACK (criticizing the current output) and (2) REFINE (improving the output based on that criticism).
Requires no gradient updates, fine-tuning, or external reward models; relies entirely on few-shot prompting.

Architecture

The iterative process of SELF-REFINE.

Evaluation Highlights

Outperforms base LLMs (GPT-3.5, GPT-4) across 7 diverse tasks with ~20% absolute average improvement.
GPT-4 with SELF-REFINE improves Dialogue Response Generation by +49.2% (absolute) over base GPT-4.
Achieves +13.9% absolute improvement in Code Readability on the CodeNet dataset using GPT-3.5.

Breakthrough Assessment

8/10

Simple yet highly effective method that unlocks significant performance gains in SOTA models without training. Demonstrates that LLMs can self-correct via natural language feedback.

⚙️ Technical Details

Problem Definition

Setting: Iterative generation where a model M maps input x to output y via intermediate feedback fb steps.

Inputs: Input sequence x (e.g., code snippet, dialogue history).

Outputs: Refined output sequence y_t.

Pipeline Flow

Initial Generation: M generates y0 from x
Feedback Loop: M generates feedback fb_t based on x and y_t
Refinement Loop: M generates y_(t+1) based on x, y_t, and fb_t
Stop Condition: Iterate until t reaches limit or stop signal

System Modules

Initial Generator

Produce the first draft output

Model or implementation: GPT-3.5, ChatGPT, or GPT-4 (same model used throughout)

Feedback Provider

Generate actionable, specific criticism of the current output

Model or implementation: Same base LLM as Generator

Refiner

Generate an improved output based on previous output and feedback

Model or implementation: Same base LLM as Generator

Novel Architectural Elements

Single-model self-feedback loop: The same frozen LLM is prompted to critique its own work and then fix it, essentially acting as a multi-agent system within one model context.

Modeling

Base Model: Evaluated on GPT-3.5 (text-davinci-003), ChatGPT (gpt-3.5-turbo), GPT-4, and Codex (code-davinci-002).

Training Method: Inference-time iterative prompting (no weights updated)

Compute: Inference only; requires multiple LLM calls per instance (iterative). Max 4 iterations used in experiments.

Comparison to Prior Work

vs. Self-Correction: SELF-REFINE does not train a separate refiner; uses few-shot prompting on the same model.
vs. PEER: SELF-REFINE does not require supervised data of edit histories.
vs. Re3 [not cited in paper]: Re3 is specific to story generation; SELF-REFINE is task-agnostic.
+ 1 more
vs. Reflexion [not cited in paper]: Reflexion relies on external evaluators/tests; SELF-REFINE generates its own feedback internally.

Limitations

Depends heavily on the base model's capability to understand instructions and generate feedback (failed with Vicuna-13B).
Inference cost increases linearly with the number of refinement iterations.
Performance gains are marginal for tasks like Math Reasoning where the model struggles to verify its own logic errors.
Susceptible to 'hallucinating' feedback or persisting in errors if the model cannot detect them.

Reproducibility

Code: https://selfrefine.info/

Code and data available at https://selfrefine.info/. Prompts for all tasks are provided in the appendix. Relies on closed-source API models (OpenAI), so exact reproduction depends on API versioning.

📊 Experiments & Results

Evaluation Setup

Zero-shot or Few-shot generation across 7 tasks

Benchmarks:

Dialogue Response Generation (Chat/Dialogue)
Code Optimization (Code Generation)
Code Readability Improvement (Code Refactoring)
Math Reasoning (GSM8K) (Reasoning)
Sentiment Reversal (Style Transfer)
Acronym Generation (Creative Generation) [New]
Constrained Generation (Hard-constraint Generation) [New]

Metrics:

Human Preference (%)
GPT-4 Preference (%)
Task-specific metrics (e.g., % solve rate, % optimized)
Statistical methodology: Confidence intervals reported in Appendix J.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results showing SELF-REFINE consistently improving over base GPT-4 performance across various tasks.
Dialogue Response	GPT-4 Preference	25.4	74.6	+49.2
Code Optimization	% programs optimized	27.3	36.0	+8.7
Code Readability	GPT-4 Preference	27.4	56.2	+28.8
Math Reasoning (GSM8K)	% solve rate	92.9	93.1	+0.2
Constrained Generation	Coverage %	15.0	45.0	+30.0
Results demonstrating SELF-REFINE's impact on GPT-3.5 (text-davinci-003).
Sentiment Reversal	Human Preference	8.8	30.4	+21.6
Acronym Generation	Human Preference	41.6	56.4	+14.8
Ablation study on the quality of feedback using ChatGPT.
Code Optimization	% programs optimized	26.0	27.5	+1.5
Sentiment Reversal	Preference	31.2	43.2	+12.0

Experiment Figures

Score improvements per iteration for Code Optimization, Sentiment Reversal, and Constrained Generation.

Main Takeaways

Specific, actionable feedback is crucial; generic feedback ('make it better') yields significantly lower gains.
Performance improves with the number of iterations, though with diminishing returns (usually plateaus after 3 iterations).
SELF-REFINE enables GPT-4 to correct its own outputs significantly in open-ended tasks (Dialogue, Constrained Generation).
Logic/Math tasks show minimal improvement because models struggle to self-identify reasoning errors compared to stylistic or constraint-based errors.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting
In-context learning (Few-shot prompting)
Iterative refinement concepts

Key Terms

LLM: Large Language Model—a neural network trained on vast text data to generate human-like text.

Few-shot prompting: Providing a model with a few examples of a task (input-output pairs) in the prompt to guide its generation.

In-context learning: The ability of a model to learn a task from examples provided in the prompt without updating its weights.

Greedy decoding: A generation strategy where the model always chooses the most probable next word.

RLHF: Reinforcement Learning from Human Feedback—training a model using rewards derived from human preferences.

Vicuna-13B: An open-source chatbot model fine-tuned from LLaMA on user-shared conversations.

CODEX: A version of GPT-3 fine-tuned on code (code-davinci-002).

GPT-3.5: Refers specifically to text-davinci-003 or gpt-3.5-turbo in this paper.

GSM8K: A benchmark dataset of high school math word problems.

CodeNet: A large-scale dataset for code tasks, used here for code readability.