ALRM: Agentic LLM for Robotic Manipulation

📝 Paper Summary

Robotic Manipulation Agentic AI

ALRM couples an LLM-based planner with a modular executor that supports both code generation and iterative tool use, enabling robots to adaptively solve linguistically diverse multistep tasks.

Core Problem

Existing LLM-robotics integrations typically lack closed-loop feedback mechanisms (making them brittle during execution) and rely on benchmarks with limited linguistic variation and reasoning depth.

Why it matters:

Rigid, open-loop pipelines (like early Code-as-Policy) cannot reflect on outcomes or correct errors during execution
Current benchmarks focus on low-level control or simple instruction phrasing, failing to test if agents can handle abstract reasoning or diverse user commands
Robotic systems need to handle multistep dependencies (e.g., 'move X before Y') which requires dynamic planning rather than static script generation

Concrete Example: In a standard setup, if a user asks to 'pick the two fruits with the lowest calories,' a standard pipeline might fail to identify the objects or stop if a grasp fails. ALRM's agentic loop allows the system to first query object properties, reason about which to pick, and retry if a specific tool call fails.

Key Novelty

Dual-Mode Agentic Execution (CaP & TaP) within a ReAct Loop

Integrates a high-level Task Planner (ReAct-based) with a specialized Task Executor that can switch between generating full Python scripts (Code-as-Policy) or iterative function calls (Tool-as-Policy)
Introduces a linguistically diverse benchmark where every canonical task is paraphrased into distinct categories (Lexical, Syntactical, Semantic, High-level reasoning) to stress-test understanding

Architecture

The ALRM architecture, illustrating the flow between the Task Planner Agent, Task Executor Agent, and the API Server/Robot.

Evaluation Highlights

Claude-4.1-Opus achieves the highest success rates among closed-source models: 93.5% in Tool-as-Policy mode and 92.6% in Code-as-Policy mode
Falcon-H1-7B achieves 84.3% success in Code-as-Policy mode, matching DeepSeek-V3.1 while requiring less than half the execution latency
The framework validates that Code-as-Policy is generally faster (single generation) while Tool-as-Policy offers finer-grained error correction via the agentic loop

Breakthrough Assessment

7/10

Solid integration of agentic reasoning into robotics with a useful dual-mode execution strategy. The new linguistically diverse benchmark addresses a significant gap in evaluating robotic reasoning robustness.

⚙️ Technical Details

Problem Definition

Setting: Robotic manipulation in a simulated environment controlled by natural language instructions requiring reasoning and multistep actions

Inputs: Natural language user instruction (e.g., 'Move the spoon, the coke, and the spatula to the basket')

Outputs: Sequence of robotic actions (pick, place, move) executed in the Gazebo simulation

Pipeline Flow

Task Planner Agent (ReAct loop for high-level decomposition)
Task Executor Agent (Converts subtasks to executable actions via CaP or TaP)
API Server (Executes actions on Simulated Robot and returns observations)

System Modules

Task Planner Agent

Decomposes user requests into high-level natural language subtasks using a ReAct loop

Model or implementation: Various LLMs (e.g., Claude, Falcon, DeepSeek)

Task Executor Agent

Translates natural language subtasks into specific API calls via Code generation (CaP) or Tool calling (TaP)

Model or implementation: Various LLMs (Shared with Planner or separate)

API Server

Interfaces with ROS/MoveIt to physically execute commands and retrieve perception data

Model or implementation: Deterministic Code (ROS/MoveIt wrapper)

Novel Architectural Elements

Dual-mode execution module enabling dynamic switching between Code-as-Policy (batch execution) and Tool-as-Policy (interactive execution) within the same agentic framework
Integration of a high-level ReAct planner with a specialized low-level executor agent, separating task decomposition from API grounding

Modeling

Base Model: Evaluated on 10 different LLMs including Claude-4.1-Opus, Falcon-H1-7B, DeepSeek-V3.1, etc.

Training Method: In-context learning / Prompting (System does not involve fine-tuning the LLMs)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Code-as-Policy: ALRM adds a ReAct agent wrapper allowing reflection and retry, whereas standard CaP is often one-shot
vs. SayCan: ALRM generates explicit code/tools for arbitrary logic rather than selecting from a fixed library of primitive skills
vs. ProgPrompt: ALRM incorporates runtime monitoring and separate planner/executor agents, while ProgPrompt is largely static plan generation
+ 1 more
vs. AutoGPT [not cited in paper]: ALRM is specialized for robotics with defined physical API constraints, whereas AutoGPT is a general-purpose agentic loop

Limitations

Reliance on simulation (Gazebo); real-world transfer not evaluated
Performance is heavily dependent on the underlying LLM's reasoning and coding capability
Latency can be high for the Tool-as-Policy mode due to multiple sequential LLM calls
Limited to single-arm manipulation tasks

Reproducibility

Code: https://tiiuae.github.io/ALRM

Code and benchmark data available at https://tiiuae.github.io/ALRM. The simulation environment uses Gazebo, ROS, and MoveIt. The framework relies on external LLM APIs (or local weights for open models) which are not part of the repo itself.

📊 Experiments & Results

Evaluation Setup

Gazebo simulation with an Interbotix wx250s arm

Benchmarks:

ALRM Benchmark (Robotic Manipulation (Pick-and-Place)) [New]

Metrics:

Task Success Rate
Execution Latency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different LLMs acting as the agent in ALRM, evaluated in both Tool-as-Policy (TaP) and Code-as-Policy (CaP) modes.
ALRM Benchmark	Success Rate (TaP)	Not reported in the paper	93.5	Not reported in the paper
ALRM Benchmark	Success Rate (CaP)	Not reported in the paper	92.6	Not reported in the paper
ALRM Benchmark	Success Rate (CaP)	84.3	84.3	0.0

Experiment Figures

Visualizations of the three simulation environments used in the benchmark.

Main Takeaways

Code-as-Policy (CaP) is generally faster than Tool-as-Policy (TaP) because it generates the entire subtask logic in a single pass, whereas TaP requires iterative LLM calls.
Claude-4.1-Opus dominates as the top-performing model across both execution modes, demonstrating superior reasoning for high-level tasks.
Falcon-H1-7B proves to be a highly efficient open-source alternative for Code-as-Policy, matching the accuracy of larger/stronger models like DeepSeek-V3.1 while significantly reducing latency.
The benchmark's diverse linguistic categories (Lexical, Syntactical, Semantic, High-level reasoning) effectively reveal gaps in model robustness beyond simple command following.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and prompting
Familiarity with robotic control stacks (ROS, MoveIt)
Knowledge of agentic frameworks (ReAct, Tool use)

Key Terms

CaP: Code-as-Policy—An approach where the LLM generates executable Python code (calling robot APIs) to perform a task in one go

TaP: Tool-as-Policy—An approach where the LLM executes actions by calling specific tools (functions) iteratively, allowing for intermediate feedback

ReAct: Reason+Act—A paradigm where the agent interleaves reasoning traces ('Thought') with action execution, allowing it to plan and adjust based on observations

ROS: Robot Operating System—A set of software libraries and tools that help build robot applications

MoveIt: A motion planning framework for ROS that calculates the trajectories needed to move a robot arm from point A to point B

Gazebo: A 3D robotics simulator used to test algorithms and robot designs in realistic physical environments

VLA: Vision-Language-Action—Models trained end-to-end to output robot actions directly from visual and text inputs