Pipeline-based paradigm: Building agents by chaining LLMs with external scripts, prompts, and modules (e.g., LangChain flows)
Model-native paradigm: Training a single model to internalize agentic behaviors (planning, memory, tools) via end-to-end learning
RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences
PPO: Proximal Policy Optimization—a standard RL algorithm used to fine-tune language models
DPO: Direct Preference Optimization—optimizing models directly on preference data without a separate reward model
GRPO: Group Relative Policy Optimization—a lightweight RL method that computes advantages based on relative rewards within a group of samples
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization—an RL method improving multi-turn performance by separating positive/negative advantages
CoT: Chain-of-Thought—prompting models to generate intermediate reasoning steps
RAG: Retrieval-Augmented Generation—fetching external data to ground model generation
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
Deep Research agent: An agent designed for knowledge-intensive tasks like literature reviews, requiring long-horizon reasoning and synthesis
GUI agent: An agent designed to interact with Graphical User Interfaces (clicking, typing) to automate software tasks