On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows

📝 Paper Summary

Inference-time alignment Agentic workflow optimization

Iterative Agent Decoding (IAD) integrates scalar and textual feedback to iteratively refine agent outputs during inference, outperforming sampling-only methods like Best-of-N particularly under fixed compute budgets.

Core Problem

Agentic AI systems struggle with complex tasks (20-30% accuracy) and standard sampling methods like Best-of-N (BoN) are compute-inefficient because they cannot iteratively improve candidates based on past failures.

Why it matters:

Current state-of-the-art agents fail frequently on benchmarks like Sketch2Code and Text2SQL, limiting real-world utility.
Post-training methods (RLHF/SFT) are often inapplicable to black-box models or commercial APIs where internals are inaccessible.
Simple parallel sampling (BoN) trades compute for latency but hits diminishing returns; it doesn't learn from mistakes within the inference window.

Concrete Example: In a Text2SQL task, a model might generate a query with a syntax error. A standard sampler just generates a new, independent guess. IAD uses a verifier to identify the error, feeds that critique back into the prompt, and guides the model to fix the specific syntax issue in the next step.

Key Novelty

Iterative Agent Decoding (IAD)

Framework that sequentially samples, evaluates, and integrates feedback to refine outputs, explicitly conditioning the next generation on the best previous solution and specific critiques.
Converts scalar rewards (numerical scores) into structured textual prompts (e.g., 'Surpass the best response, avoid previous mistakes') to guide black-box models without access to gradients.
Systematically analyzes the trade-off between parallel sampling (BoN) and sequential feedback-driven refinement under fixed compute budgets.

Architecture

Illustration of the inference-time alignment loop comparing standard methods to IAD.

Evaluation Highlights

Achieves up to 10% absolute performance improvement over Best-of-N and other baselines on Sketch2Code, Text2SQL, Intercode, and WebShop under constrained budgets.
Demonstrates 4–8% gains specifically from feedback-guided refinement (isolating gains beyond sampling diversity) in Sketch2Code and Text2SQL.
Transforming scalar feedback into directional prompts yields 6–7% gains over baselines that simply append scores, proving the importance of feedback design.

Breakthrough Assessment

7/10

Provides a strong, systematic empirical study on inference-time scaling for agents, offering a practical framework (IAD) that works with black-box models. While the concept of feedback isn't new, the rigorous quantification of compute vs. accuracy trade-offs is valuable.

⚙️ Technical Details

Problem Definition

Setting: Black-box inference-time alignment where a reference policy π_0 is optimized to approximate an optimal policy π* using a verifier R(x,y).

Inputs: Input task x (e.g., natural language question, user sketch) and access to a verifier R.

Outputs: Optimized response y (e.g., SQL query, HTML code).

Pipeline Flow

Sampling (Generate candidate y_{t+1})
Evaluation (Compare y_{t+1} vs best-so-far y^t using Verifier R)
Feedback Integration (Convert evaluation to prompt instructions)

System Modules

Generator

Samples new candidate solutions from the reference policy conditioned on input, previous best solution, and feedback.

Model or implementation: Gemini-1.5-Pro, Gemini-1.5-Flash, GPT-4o, etc. (Black-box access)

Verifier / Judge

Evaluates the new candidate against the current best solution to determine which to keep.

Model or implementation: Varies (Rule-based execution, CLIP similarity, or LLM-as-a-judge)

Feedback Formatter

Translates evaluation signals (scores or critiques) into structured prompt instructions.

Model or implementation: Deterministic formatting logic or LLM transformation

Novel Architectural Elements

Feedback integration module specifically designed to translate scalar verifier outputs into directional prompt instructions ('Improve upon X while avoiding Y').

Modeling

Base Model: Gemini-1.5-Pro, Gemini-1.5-Flash, GPT-4o, Gemini-2.5-Pro (varies by experiment)

Training Method: Inference-time optimization only (no weight updates)

Compute: NVIDIA A100-SXM4-40GB GPU for experiments; compute budget defined by number of API calls/tokens.

Comparison to Prior Work

vs. BoN: IAD is sequential and uses feedback from previous steps to improve the proposal distribution, whereas BoN samples independently.
vs. Self-Refine: IAD systematically integrates both scalar and textual feedback and conditions explicitly on 'best-so-far' and 'worst' examples, rather than just self-critique.
vs. Tree of Thoughts (ToT) [not cited in paper]: IAD focuses on iterative refinement of a full solution with feedback, whereas ToT explores a tree of partial reasoning steps.

Limitations

Performance gains from feedback diminish as the compute budget (number of samples) increases significantly.
The margin of improvement narrows when using highly capable models (e.g., Gemini-1.5 to Gemini-2.5).
Heavily relies on the quality of the verifier; noisy or sparse feedback significantly degrades performance.
Generating high-quality textual feedback can be expensive and slow compared to scalar signals.

Reproducibility

Code availability is not provided in the paper text. Specific prompts for feedback generation are described in the methodology section. Base models are commercial APIs (Gemini, GPT).

📊 Experiments & Results

Evaluation Setup

Agentic tasks requiring reasoning, coding, and multi-modal understanding.

Benchmarks:

Sketch2Code (Multi-modal generation (Sketch to HTML))
Text2SQL (BIRD) (Code generation / Semantic parsing)
Intercode (Bash) (Code generation / Agentic decision making)
WebShop (Web agent / Decision making)

Metrics:

Layout Similarity (Sketch2Code)
Execution Accuracy (Text2SQL)
Success Rate (WebShop)
Reward / Success (Intercode)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
IAD demonstrates consistent improvements over Best-of-N (BoN) and single-turn baselines across various agentic benchmarks, particularly with lower compute budgets.
Sketch2Code	Layout Similarity	Not explicitly reported as exact number in text	Not explicitly reported as exact number in text	+3-4%
Multiple (Sketch2Code, Text2SQL, etc.)	Absolute Performance	Varies by task	Varies by task	+10%
Sketch2Code / Text2SQL	Performance Gain	Varies	Varies	+4-8%
Sketch2Code / Text2SQL	Performance Gain	Varies	Varies	+6-7%
General	Performance Gain	Varies	Varies	-4-5%

Experiment Figures

Performance curves (Layout Similarity, Text IoU, Image IoU) vs. Compute Budget for Sketch2Code across different models.

Execution accuracy vs. Number of LLM calls for Text2SQL.

Main Takeaways

Feedback is a critical knob for test-time scaling when compute (API calls) is constrained; IAD is more compute-optimal than BoN in low-budget regimes.
The value of feedback diminishes as the base model becomes stronger (e.g., Gemini-2.5 vs 1.5) or as the sample budget increases significantly.
Effective use of scalar feedback requires transforming it into directional instructions (comparing best vs. worst) rather than just appending raw scores.
IAD is robust to moderate noise in feedback but degrades to baseline performance levels when feedback becomes highly sparse or noisy.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting strategies
Familiarity with reinforcement learning concepts (policies, rewards)
Knowledge of sampling strategies like Best-of-N (BoN)

Key Terms

IAD: Iterative Agent Decoding—a sequential inference framework that uses feedback to iteratively refine agent outputs.

BoN: Best-of-N—a sampling strategy where N candidate solutions are generated independently, and the best one is selected by a verifier.

Inference-time alignment: Techniques to improve model outputs during generation (test time) rather than during training, often using extra compute.

Scalar feedback: Numerical scores or binary pass/fail signals provided by a verifier.

Textual feedback: Natural language critiques or instructions describing specific errors or improvements.

Sketch2Code: A benchmark task converting wireframe sketches into functional HTML code.

Text2SQL: A task mapping natural language questions to executable SQL queries.

Verifier: A function or model that evaluates the quality of a generated response, used to guide selection or feedback.