Agentic Knowledgeable Self-awareness

📝 Paper Summary

Self-evolving Agentic reasoning Agentic feedback mechanisms Knowledge-augmented planning

KnowSelf trains agents to autonomously recognize when to act fast, when to reflect, and when to query external knowledge, optimizing performance while minimizing unnecessary computation.

Core Problem

Current agent planning methods use a 'flood irrigation' approach, indiscriminately injecting knowledge or feedback regardless of necessity, which is inefficient and leads to pattern collapse.

Why it matters:

Blindly invoking external knowledge or reflection for every step significantly increases inference latency and computational cost
Agents trained via direct imitation often memorize patterns rather than reasoning, becoming fragile to unexpected environmental signals
Humans dynamically assess whether they need help or extra thought; lack of this metacognition makes agents inefficient compared to human decision-making

Concrete Example: In a task like 'put a clean egg in microwave', a standard agent might repeatedly try to pick up an egg without realizing it needs cleaning first. An agent relying solely on knowledge might retrieve irrelevant cooking instructions. KnowSelf, however, first tries to act; if it fails, it reflects ('I need to clean it'), or if reflection fails, it queries knowledge ('To obtain a cleaned object, find it then clean it').

Key Novelty

KnowSelf: Data-Centric Situational Self-Awareness

Classifies decision states into three tiers based on agent capability: Fast Thinking (confident action), Slow Thinking (needs reflection), and Knowledgeable Thinking (needs external help)
Constructs training data by marking self-explored trajectories with special tokens (<r> for reflection, <k> for knowledge) based on whether the agent's initial or reflected prediction was correct
Uses a two-stage training process (SFT + RPO) to teach the agent to self-generate these tokens, effectively deciding its own inference path dynamically

Architecture

The KnowSelf framework covering Data Construction, Learning, and Inference. It illustrates how trajectories are marked with <r> (Reflection) and <k> (Knowledge) tokens and how the model switches between Fast, Slow, and Knowledgeable thinking during inference.

Evaluation Highlights

Outperforms GPT-4o-based ExpeL on ALFWorld using Llama-8B with only 15.01% of actions requiring external knowledge (vs 100% for ExpeL)
Achieves 91.67% success on 'Put' tasks in ALFWorld with Gemma-2B, surpassing the 0% success rate of the standard ReAct baseline
Demonstrates strong generalization: models trained only on simple tasks (Put, Clean, Examine) achieve >70% success on unseen complex tasks (Heat, Cool, PutTwo), whereas baselines like KnowAgent fail completely (0%)

Breakthrough Assessment

8/10

Significant efficiency gain by making knowledge retrieval dynamic rather than static. Effectively addresses the trade-off between performance and inference cost in agentic systems.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where an agent decides the next action based on history and observation

Inputs: Historical interaction trajectory h_t = (u, a0, o0, ...)

Outputs: Next action a_{t+1}, potentially preceded by special tokens for reflection or knowledge retrieval

Pipeline Flow

Input History Processing
Situational Judgment (Internal)
Branching: Direct Action OR Reflection OR Knowledge Retrieval
Action Generation

System Modules

Agent Model

Predicts next token; determines if action can be generated directly or if special tokens <r>/<k> are needed

Model or implementation: Llama-8B or Gemma-2B (fine-tuned)

Knowledge Selector

Selects relevant knowledge from the base when the agent generates the <k> token

Model or implementation: Retriever (R)

Environment

Executes the action and returns the next observation

Model or implementation: ALFWorld / WebShop simulator

Novel Architectural Elements

Situational token generation mechanism: The model itself generates <r> or <k> to switch inference modes (Fast/Slow/Knowledgeable) within a single autoregressive generation process
Three-tier cognitive architecture embedded into the LLM's vocabulary and training data via heuristic labeling of self-explored trajectories

Modeling

Base Model: Llama-3.1-8B-Instruct and Gemma-2-2b-it

Training Method: Two-stage training: Supervised Fine-Tuning (SFT) followed by Relative Preference Optimization (RPO)

Objective Functions:

Purpose: Teach initial self-awareness patterns.

Formally: L_SFT = -E[log π_θ(y|h_t)] on self-awareness data.
Purpose: Boost self-awareness by contrasting correct vs. incorrect situational judgments.

Formally: L_DPO term penalizing incorrect predictions relative to reference model.
Purpose: Stabilize training in narrow action spaces during preference learning.

Formally: L_NLL term added to DPO, resulting in L_RPO = L_DPO + α * L_NLL.

Training Data:

Self-exploration on training tasks to collect trajectories
Heuristic labeling: If prediction correct -> Fast; If wrong but fixable via rethink -> Slow; If wrong after rethink -> Knowledgeable
Negative samples collected for DPO by letting the reference agent explore and fail

Key Hyperparameters:

learning_rate_stage_1: 2e-5
learning_rate_stage_2: 5e-7
batch_size_stage_1: 8
+ 6 more
batch_size_stage_2: 3
beta_dpo: 0.5
alpha_rpo: 1
epochs_stage_1: 3
epochs_stage_2: 1
temperature: 0

Compute: 8 NVIDIA A800 80G GPUs

Comparison to Prior Work

vs. ExpeL/KnowAgent: KnowSelf selectively uses knowledge (approx 15-26% of steps) rather than 'flood irrigation' at every step
vs. Reflexion: KnowSelf internalizes the decision to reflect rather than relying on an external loop or trigger
vs. ReAct: ReAct forces reasoning at every step; KnowSelf allows 'Fast Thinking' (skipping reasoning) for simple steps

Limitations

Experiments limited to simulated environments (ALFWorld, WebShop) and do not cover real-world robotics or code generation
Tested only on smaller open-source models (2B, 8B), scaling to 70B+ not explored
Reliance on a heuristic criterion for labeling data, which might not perfectly capture true cognitive states
Knowledge retrieval uses a simple selection mechanism; more complex retrieval could affect results

Reproducibility

Code: https://github.com/zjunlp/KnowSelf

Code available at https://github.com/zjunlp/KnowSelf. Prompt for rethinking and knowledge construction details provided in Appendix. Knowledge base construction is offline and lightweight.

📊 Experiments & Results

Evaluation Setup

Agentic planning in simulated environments

Benchmarks:

ALFWorld (Embodied instruction following (household tasks))
WebShop (Online shopping navigation)

Metrics:

Average Reward (Success Rate for ALFWorld, Score 0-1 for WebShop)
Knowledge Rate (Know%) - percentage of actions utilizing external knowledge
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on ALFWorld showing KnowSelf outperforms baselines while using significantly less knowledge.
ALFWorld	Average Reward	79.85	84.33	+4.48
ALFWorld	Average Reward	27.61	84.33	+56.72
ALFWorld	Average Reward	8.96	79.85	+70.89
WebShop	Average Reward	57.65	67.14	+9.49
WebShop	Average Reward	21.63	63.65	+42.02
Ablation and generalization studies demonstrate the mechanism's robustness.
ALFWorld	Average Reward	67.0	84.33	+17.33
ALFWorld	Average Reward	78.0	84.33	+6.33

Experiment Figures

Ablation results, Generalization capabilities, and Scaling Law analysis.

Layer-wise probability analysis of token generation for [Knowledge], [Reflection], and Action tokens.

Main Takeaways

KnowSelf outperforms 100% knowledge-augmented baselines (like ExpeL and KnowAgent) while using minimal external knowledge (15-26%), proving that 'flood irrigation' of knowledge is inefficient and potentially harmful.
Smaller models (Gemma-2B) benefit massively from this paradigm, achieving performance comparable to or exceeding GPT-4o baselines on specific tasks.
Generalization is significantly improved: KnowSelf maintains performance on unseen task types (e.g., Heat, Cool) where baselines like ETO and KnowAgent collapse to near-zero success.
Mechanism analysis reveals that the decision to invoke knowledge emerges in the final layers of the Transformer, suggesting an internal 'game-like' process where the model evaluates confidence.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Agentic Planning (ReAct, Reflexion)
Knowledge of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
Familiarity with POMDPs in embodied environments

Key Terms

Fast Thinking: Situation where the agent can directly generate the correct action without internal monologue or external help

Slow Thinking: Situation where the agent initially predicts incorrectly but can correct itself through self-reflection

Knowledgeable Thinking: Situation where the agent fails even after reflection and requires external knowledge to proceed

RPO: Relative Preference Optimization—a loss function combining DPO with a negative log-likelihood (NLL) term to stabilize training in narrow action spaces

DPO: Direct Preference Optimization—an algorithm aligning language models to preferences without a separate reward model

SFT: Supervised Fine-Tuning—training a model on labeled examples

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot see the entire state of the environment

ReAct: Reasoning and Acting—a paradigm where agents generate reasoning traces before executing actions

ExpeL: Experience Learning—a baseline method where agents learn from past experiences/trajectories

Pattern Collapse: A failure mode where a model blindly follows learned sequences (patterns) rather than reasoning about the specific context

Scaling Law: The observation that model performance typically improves as model size, data size, or compute increases