ReAct: Synergizing Reasoning and Acting in Language Models

📝 Paper Summary

Agentic RAG pipeline Multi-call tool use with flexible plan

ReAct prompts language models to alternate between generating internal reasoning traces and external actions, allowing them to dynamically update plans and gather information to solve complex tasks.

Core Problem

Language models separate reasoning (like Chain-of-Thought) from acting (like WebGPT), causing hallucinations in reasoning tasks and inefficient planning in interactive tasks.

Why it matters:

Chain-of-thought reasoning suffers from fact hallucination and error propagation because it is not grounded in the external world.
Action-only models struggle with long-horizon goals and exception handling because they lack a mechanism to maintain and update high-level plans.
Combining both capabilities is essential for general intelligence that can learn new tasks quickly and handle unseen circumstances.

Concrete Example: In HotpotQA, a reasoning-only model hallucinates that 'Apple Remote was designed to control Apple TV' (wrong). An act-only model searches 'Apple Remote' but fails to synthesize the finding that it controls 'Front Row'. ReAct reasons about what to search ('search Apple Remote'), executes the search, reads the observation ('controls Front Row'), and updates its plan to search for 'Front Row' next.

Key Novelty

Interleaved Thought-Action Loop (ReAct)

Augments the action space to include 'thoughts'—free-form language steps that do not affect the environment but update the context for future steps.
Prompts the model to generate a sequence of Thought → Action → Observation, enabling dynamic planning (reason to act) and information incorporation (act to reason).

Architecture

Comparison of four prompting methods (Standard, CoT, Act-only, ReAct) on HotpotQA and ALFWorld.

Evaluation Highlights

Outperforms imitation and reinforcement learning baselines on ALFWorld (text game) by 34% (absolute success rate) with only two in-context examples.
Improves success rate on WebShop (online shopping) by 10% absolute over imitation learning baselines using one-shot prompting.
Reduces hallucination rates in HotpotQA compared to Chain-of-Thought (6% vs 14% false positives) by grounding reasoning in external Wikipedia retrievals.

Breakthrough Assessment

9/10

Seminal paper that established the standard paradigm for LLM agents. It moved the field from static reasoning (CoT) to dynamic agentic loops, influencing almost all subsequent agent frameworks.

⚙️ Technical Details

Problem Definition

Setting: Interactive task solving where an agent receives observation o_t, takes action a_t, and maintains context c_t.

Inputs: Context c_t containing history of observations, actions, and thoughts.

Outputs: Next action a_t (external) or thought ^a_t (internal language trace).

Pipeline Flow

Input (Task/Context)
LLM Generation (Thought or Action)
Environment Execution (if Action)
Context Update

System Modules

Prompt Constructor

Concatenates few-shot examples (human trajectories of thought-action-observation) with the current task description.

Model or implementation: Frozen PaLM-540B

Agent (LLM)

Generates the next step, which can be a 'Thought' (reasoning) or an 'Action' (environment interaction).

Model or implementation: Frozen PaLM-540B

Environment

Executes external actions and returns observations. (Wikipedia API, ALFWorld engine, or WebShop browser).

Model or implementation: External Simulator/API

Novel Architectural Elements

Augmented action space ^A = A ∪ L, where L is the space of language (thoughts).
Interleaved trajectory structure: Context contains alternating [Observation, Thought, Action, Observation...] sequences.

Modeling

Base Model: PaLM-540B

Training Method: Fine-tuning (Bootstrap)

Objective Functions:

Purpose: Teach smaller models (PaLM-8B/62B) to perform ReAct by mimicking successful trajectories from the large model.

Formally: Standard language modeling loss on trajectories generated by PaLM-540B that achieved the correct answer.

Training Data:

3,000 correct trajectories generated by ReAct (PaLM-540B) on HotpotQA training set.

Key Hyperparameters:

decoding_temperature: 0.7 (for CoT-SC sampling)
samples_n: 21 (for CoT-SC)

Compute: PaLM-540B inference (heavy compute). Fine-tuning done on PaLM-8B and PaLM-62B.

Comparison to Prior Work

vs. CoT: ReAct interacts with external environments to ground reasoning, whereas CoT relies solely on internal weights.
vs. WebGPT/Act-only: ReAct maintains a verbal working memory (thoughts) to track goals/subgoals, whereas Act-only predicts actions directly from history.
vs. Inner Monologue: ReAct generates free-form thoughts proactively, whereas Inner Monologue relies on pre-defined feedback injected from the environment.
+ 1 more
vs. MRKL [not cited in paper]: MRKL routes to tools via a neural router, while ReAct interleaves reasoning and tool use in a single stream.

Limitations

Constraint of structure (Reason-Act-Observe loop) can reduce flexibility compared to pure CoT in tasks not requiring retrieval.
Dependent on the retrieval quality; non-informative search results can derail reasoning.
Higher inference latency and cost due to multiple calls to the LLM and environment per task.
Performance heavily dependent on the capability of the base LLM (requires PaLM-540B or GPT-3 for best results).

Reproducibility

Code: https://react-lm.github.io/

Project page with code: https://react-lm.github.io/. PaLM-540B is not open source. The authors provided GPT-3 results in the appendix for reproducibility. Prompts for all tasks are detailed in the Appendix.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on reasoning (QA, Fact Verification) and decision making (Text Game, Web Shopping) tasks.

Benchmarks:

HotpotQA (Multi-hop Question Answering)
FEVER (Fact Verification)
ALFWorld (Text-based Embodied Game)
WebShop (Online Shopping Navigation)

Metrics:

Exact Match (EM)
Accuracy
Success Rate (SR)
Score (WebShop specific metric)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reasoning task results showing ReAct competitive with CoT but superior when combined (best of both worlds).
HotpotQA	Exact Match (EM)	28.7	27.4	-1.3
FEVER	Accuracy	57.1	60.9	+3.8
HotpotQA	Exact Match (EM)	33.4	35.1	+1.7
Decision making results showing large gains over imitation/RL baselines.
ALFWorld	Success Rate	37	71	+34
WebShop	Success Rate (SR)	30.1	40.0	+9.9
WebShop	Success Rate (SR)	28.7	40.0	+11.3

Experiment Figures

Scaling behavior of prompting vs fine-tuning for different model sizes on HotpotQA.

Main Takeaways

Synergy: ReAct combines the reasoning benefits of CoT (planning, decomposition) with the grounding benefits of Act (external information retrieval), reducing hallucinations.
Efficiency: In interactive tasks (ALFWorld, WebShop), ReAct outperforms Imitation/RL methods trained on thousands of examples using only 1-2 prompts.
Interpretability: ReAct traces are human-readable, allowing for diagnosis of errors (e.g., 'Reasoning error' vs 'Search result error') and manual intervention.
Finetuning Potential: While prompting PaLM-540B is best, fine-tuning smaller models (PaLM-8B/62B) on ReAct trajectories significantly outperforms fine-tuning them on Standard or CoT data.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and In-context Learning
Chain-of-Thought (CoT) prompting
Basic Reinforcement Learning concepts (Observation, Action, Policy)

Key Terms

ReAct: Reason+Act: A prompting paradigm where LLMs generate both reasoning traces and task-specific actions in an interleaved manner.

Chain-of-Thought (CoT): A prompting method where models generate intermediate reasoning steps before the final answer.

Act-only: A baseline where the model generates actions directly based on observations without explicit reasoning traces.

Hallucination: When a language model generates plausible-sounding but factually incorrect information.

ALFWorld: A synthetic text-based game benchmark requiring embodied reasoning and multi-step planning in a household environment.

WebShop: A benchmark simulating an online shopping website where agents must follow instructions to find and buy products.

CoT-SC: Chain-of-Thought with Self-Consistency—sampling multiple reasoning paths and taking the majority vote answer.

Zero-shot / Few-shot: Providing the model with zero or a few examples of the task in the prompt to guide its behavior.

Imitation Learning (IL): Training an agent to mimic expert demonstrations.

Reinforcement Learning (RL): Training an agent to maximize a reward signal through trial and error.