Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

📝 Paper Summary

Model-native Agentic AI Agentic Planning, Tool Use, and Memory Reinforcement Learning for Agents

Agentic AI is shifting from brittle, externally engineered pipelines to model-native systems where planning, tool use, and memory are internalized behaviors learned through end-to-end reinforcement learning.

Core Problem

Pipeline-based agents rely on rigid, handcrafted workflows and external modules (e.g., separate planners, retrievers) that make systems brittle, hard to scale, and unable to adapt to dynamic environments.

Why it matters:

External pipelines treat the LLM as a passive tool rather than a proactive decision-maker, limiting autonomy
Pre-scripted execution logic fails when agents face unforeseen circumstances or changing interface states (non-stationarity)
Step-by-step supervision for complex agentic tasks is prohibitively expensive to annotate compared to outcome-driven learning

Concrete Example: In a pipeline-based GUI agent like AppAgent, the workflow is orchestrated by fixed XML view-hierarchy parsing. If the UI layout changes unexpectedly (e.g., a pop-up appears that isn't in the script), the rigid pipeline fails because the model hasn't learned to perceive and adapt to the new state autonomously.

Key Novelty

The Model-Native Agent Paradigm (LLM + RL + Task)

Reframes the agent as a single unified model that learns capabilities (planning, tool use, memory) as intrinsic parameters rather than using external modules
Uses outcome-driven Reinforcement Learning (RL) to train models to explore and discover strategies (like 'thinking' or tool invocation) without needing expensive step-by-step human labels

Architecture

The paradigm shift from Pipeline-based to Model-native Agentic AI, contextualized by the evolution of RL algorithms.

Evaluation Highlights

DeepSeek-R1 demonstrates that outcome-based RL is sufficient to train reasoning and planning behaviors without step-by-step supervision
OpenAI's o3 model internalizes tool use into the reasoning process, learning when to invoke tools as part of its policy rather than via external prompts
Tongyi DeepResearch model executes complex, multi-step research tasks in dynamic web environments by internalizing the research strategy

Breakthrough Assessment

9/10

Captures a fundamental shift in the field. The transition from 'engineering flows' to 'training flows' via RL represents the next major phase of agentic AI scaling.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) where the agent observes state, generates actions (text/decisions), and receives outcome-based rewards

Inputs: Task context (state) and dynamic environment feedback

Outputs: Policy πθ that maximizes expected cumulative reward over the task horizon

Pipeline Flow

Base Model Initialization
Environment Interaction (Exploration)
Reinforcement Learning Update

System Modules

Base LLM

The unified policy model that generates thoughts, plans, and actions

Model or implementation: Various (e.g., DeepSeek-R1, o3, Qwen-2.5-1M)

Environment / Task

Provides feedback and rewards based on the outcome of the agent's actions

Model or implementation: Task-specific simulation (e.g., Web browser, GUI simulator)

RL Optimizer

Updates the model parameters to maximize cumulative reward

Model or implementation: Algorithms like GRPO, DAPO

Novel Architectural Elements

Unified 'Model-native' architecture where Planning, Tool Use, and Memory are parameter-internalized behaviors rather than distinct software modules
Transition from 'Pipeline' (LLM + P + Tools) to 'End-to-End' (LLM + RL + Task) optimization loop

Modeling

Base Model: Varies by implementation (e.g., DeepSeek-R1, Qwen-2.5, Llama-3)

Training Method: Outcome-driven Reinforcement Learning (RL)

Objective Functions:

Purpose: Maximize expected cumulative reward through exploration.

Formally: Maximize E[sum(rewards)] via policy gradient methods.

Adaptation: Full model training or parameter-efficient fine-tuning depending on specific paper cited

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReAct/AutoGPT: Internalizes the reasoning/acting loop into model weights via RL rather than prompting
vs. AppAgent: Learns to perceive and act from raw visual/UI context end-to-end rather than relying on external XML parsers
vs. Traditional RAG: Moves toward 'MemoryLLM' styles where memory is parameterized/updated in the forward pass rather than just retrieved

Limitations

Outcome-driven RL may amplify hallucinations by rewarding spurious correlations in open-ended tasks (like research)
Defining reward functions for subjective tasks (e.g., 'insightfulness' of a report) is inherently difficult
GUI environments are non-stationary (updates, dynamic layouts), making reuse of collected trajectories difficult
High computational cost of large-scale RL exploration compared to supervised fine-tuning

Reproducibility

Code: https://github.com/ADaM-BJTU/model-native-agentic-ai

A curated list of reviewed papers is publicly available at https://github.com/ADaM-BJTU/model-native-agentic-ai. The survey reviews existing works; specific code for the surveyed models (e.g., OpenAI o3) is often closed-source, though open weights models (DeepSeek-R1, Qwen) are mentioned.

📊 Experiments & Results

Evaluation Setup

Survey paper evaluating the field's progression; references performance on benchmarks from cited papers.

Benchmarks:

Deep Research Tasks (Long-horizon information retrieval and synthesis)
GUI Automation Tasks (Web and Mobile UI interaction)

Metrics:

Success Rate
Pass@k
Long-horizon consistency
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The field is moving from 'Pipeline-based' (System 1) to 'Model-native' (System 2) architectures driven by RL.
RL enables models to discover strategies that do not exist in human-labeled SFT data, crucial for complex reasoning.
Agent applications are bifurcating into 'Brain' (Deep Research) and 'Hands/Eyes' (GUI Agents), both converging on model-native training.
Memory is evolving from simple vector retrieval to parameterized knowledge that updates during inference.

📚 Prerequisite Knowledge

Prerequisites

Fundamentals of Large Language Models (LLMs)
Reinforcement Learning (RL) basics (policy, reward, MDP)
Familiarity with agent architectures (ReAct, RAG)

Key Terms

Pipeline-based paradigm: Building agents by chaining LLMs with external scripts, prompts, and modules (e.g., LangChain flows)

Model-native paradigm: Training a single model to internalize agentic behaviors (planning, memory, tools) via end-to-end learning

RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences

PPO: Proximal Policy Optimization—a standard RL algorithm used to fine-tune language models

DPO: Direct Preference Optimization—optimizing models directly on preference data without a separate reward model

GRPO: Group Relative Policy Optimization—a lightweight RL method that computes advantages based on relative rewards within a group of samples

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization—an RL method improving multi-turn performance by separating positive/negative advantages

CoT: Chain-of-Thought—prompting models to generate intermediate reasoning steps

RAG: Retrieval-Augmented Generation—fetching external data to ground model generation

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

Deep Research agent: An agent designed for knowledge-intensive tasks like literature reviews, requiring long-horizon reasoning and synthesis

GUI agent: An agent designed to interact with Graphical User Interfaces (clicking, typing) to automate software tasks