Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction

📝 Paper Summary

Robotic Middleware Integration Vision-Language-Action (VLA) Models

The authors propose a taxonomy for 'Agentic AI' in robotics, classifying how foundation models interface with middleware like ROS as translators, orchestrators, or embedded policies rather than just end-to-end learners.

Core Problem

Prior surveys focus primarily on end-to-end multimodal learning or high-level planning, neglecting the emerging software design patterns where AI agents interface with standard robotic middleware and tools.

Why it matters:

Practical deployment relies on integrating LLMs with existing, tested software stacks (like ROS) rather than replacing them entirely
Many impactful developments are community-driven (GitHub projects, MCP servers) and remain underrepresented in academic literature
There is a lack of clear terminology distinguishing 'end-to-end' control from modular 'agentic' middleware approaches

Concrete Example: Early approaches like ROS2AI simply translated text to CLI commands. Newer frameworks like ROSA need to maintain state, validate parameters, and coordinate multiple tools (navigation, manipulation) safely, requiring a structured architecture beyond simple translation.

Key Novelty

Taxonomy of Agentic Integration and Roles

Classifies integration into four distinct approaches: Protocol (translator), Interface (interactive loop), Orchestration (resource manager), and Embedded (direct policy)
Distinguishes agent roles based on functional design: Planners (generate sequence upfront) vs. Orchestrators (active runtime management of subsystems)
Highlights the shift from centralized control to decentralized protocols (e.g., FABRIC in OpenMind) and plugin-based architectures (MCP servers)

Breakthrough Assessment

7/10

A timely systematization of the rapidly growing 'middleware agent' space in robotics. While it doesn't propose a new model, the taxonomy provides necessary structure for comparing disparate frameworks like ROSA, RAI, and RT-2.

⚙️ Technical Details

Problem Definition

Setting: Review and classification of software architectures integrating Foundation Models with Robotic Systems

Inputs: Academic papers, GitHub repositories, and industrial frameworks (2022-2025)

Outputs: Taxonomy of integration approaches and agent roles

Pipeline Flow

Protocol Integration (Translator)
Interface Integration (Interactive)
Orchestration-Oriented Integration (Manager)
Embedded/Direct Integration (Policy)

System Modules

Protocol Integration (Taxonomy Category)

Acts as a translator between user input and predefined toolsets/protocols

Model or implementation: Generic LLMs (e.g., GPT-4, Claude)

Interface Integration (Taxonomy Category)

Provides interactive loops connecting user, robot, and environment; tool outputs affect future commands

Model or implementation: Agentic Frameworks (e.g., ROSA, RAI)

Orchestration-Oriented Integration (Taxonomy Category)

Manages resources, tools, or subsystems; coordinates multiple agents

Model or implementation: Multi-Agent Systems

Embedded Integration (Taxonomy Category)

Directly produces actions or perception outputs; often end-to-end

Model or implementation: VLA / LBM (Large Behavior Models)

Novel Architectural Elements

Classification of 'MCP Servers' as a distinct plugin-based integration pattern
Differentiation between 'Planner Agents' (generate plan upfront) and 'Orchestration Agents' (active runtime management)

Reproducibility

The paper is a survey; it references external open-source projects (ROSA, RAI, ros2ai, ROS-MCP) but does not appear to release a new standalone codebase itself. The 'practical design toolkit' mentioned in the abstract (Section V) is not included in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Qualitative review and classification of academic papers and community projects from 2022 to 2025.

Benchmarks:

N/A (Survey) (Literature Review)

Metrics:

Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Evolution of Interfaces: Shifted from simple CLI translators (ros2ai, 2023) to state-aware agents (ROSA, 2024) and decentralized protocols (OpenMind/OM1, 2025).
Rise of MCP: The Model Context Protocol is emerging as a standard for tool use, allowing general assistants (like Claude) to plugin to robotics without custom middleware.
Architectural Divergence: A clear split exists between 'Planner' agents (generating static code/sequences, e.g., SayCan) and 'Orchestrator' agents (actively managing perception/execution loops, e.g., RAI).
Community Impact: Significant innovation is driven by GitHub-hosted projects (ROS-MCP, ROSA) rather than purely academic publications.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Robot Operating System (ROS/ROS 2)
Understanding of Large Language Models (LLMs) and Vision-Language Models (VLMs)
Basic knowledge of agentic patterns (ReAct, Chain-of-Thought)

Key Terms

ROS: Robot Operating System—standard middleware for robot software development, managing communication between nodes (sensors, actuators, logic)

VLA: Vision-Language-Action models—foundation models trained to output low-level robot actions directly from visual and textual inputs (e.g., RT-2)

MCP: Model Context Protocol—an open standard allowing AI assistants (like Claude) to connect to external data and tools (like ROS nodes) via a client-host server architecture

Agentic AI: Systems where AI models act as autonomous intermediaries that reason, plan, and invoke external tools or APIs to achieve goals, rather than just generating text

ReAct: Reasoning and Acting—a paradigm where LLMs interleave reasoning traces with executable actions/tool calls

FABRIC: A decentralized coordination protocol (used in OpenMind) for secure identity management and interoperability among heterogeneous robots

SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to a specific task