Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning

📝 Paper Summary

Agentic AI Observation Management (OM) Thought Management (TM)

Agent-Omit trains LLM agents to autonomously decide when to skip reasoning steps or discard past observations during multi-turn interactions, balancing efficiency with task success.

Core Problem

Existing LLM agents generate redundant reasoning thoughts for simple actions and accumulate excessive observation contexts over long trajectories, wasting tokens and slowing inference.

Why it matters:

Thought and observation tokens dominate agent costs (e.g., ~97% of tokens in WebShop), while actual actions account for only ~3%
Prior methods compress trajectories equally or use static heuristics, failing to recognize that the utility of thoughts and observations varies significantly across different turns
Long contexts with irrelevant observations act as noise that degrades performance in later turns

Concrete Example: In a shopping task, an agent might generate detailed reasoning for a simple 'click next' action, or retain outdated search results from Turn 1 that are irrelevant to the final answer in Turn 10. Agent-Omit detects this and outputs an empty thought or an omission command.

Key Novelty

Adaptive Omission via Agentic RL

Treats 'omission' as a learnable action: agents learn to output empty thoughts or specific command tokens to prune history based on the current context's necessity
Uses a dual-sampling RL strategy that learns from both full trajectories (for final success) and partial trajectories (to learn omission decisions given the pre-omission context)

Architecture

The Agent-Omit framework, showing the two-stage process: (a) synthesizing cold-start data (identifying removable turns via rollouts), and (b) Omit-Aware Agentic RL training with dual sampling.

Evaluation Highlights

Agent-Omit-8B achieves comparable accuracy to frontier models like DeepSeek-R1 and o3 on 5 benchmarks while substantially reducing token costs
Outperforms 7 efficient agent baselines (e.g., ToolLight, MEM-Agent) in effectiveness-efficiency trade-offs when applied to Qwen3-8B
Adaptively omits 3-4 rounds of thoughts/observations per task on average, primarily in intermediate turns where redundancy is highest

Breakthrough Assessment

8/10

Strong methodological contribution by formalizing omission as a policy learned via RL rather than heuristics. Addresses a critical efficiency bottleneck in agentic workflows with solid theoretical backing (KL bounds) and empirical results.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn agent-environment interaction optimization where the agent must maximize task reward while minimizing token usage

Inputs: Task question q and interaction history {τ1, a1, o1, ...}

Outputs: Thought τt (potentially empty), Action at, and Omission Set Γt (indices of past observations to remove)

Pipeline Flow

Input Processing (History + Question)
Adaptive Generation (Thought/Action/Omission)
Environment Interaction & Context Update

System Modules

Agent Policy

Generates thought, action, and omission commands

Model or implementation: Qwen3-8B (Agent-Omit-8B)

Context Manager

Updates the interaction history based on omission commands

Model or implementation: Rule-based logic

Novel Architectural Elements

Integration of omission commands directly into the agent's action space (e.g., <omit_tool_response...>)
Dual-sampling mechanism during RL: sampling 'partial trajectories' (pre-omission context) alongside full trajectories to solve the context-change problem where agents forget why they omitted something

Modeling

Base Model: Qwen3-8B

Training Method: Omit-Aware Agentic Reinforcement Learning (Omit-Aware RL)

Objective Functions:

Purpose: Maximize expected reward balancing task success and token reduction.

Formally: Maximize E[r(y)] subject to KL constraint.
Purpose: Penalize deviations from reference policy.

Formally: β * KL(πθ || πref)
Purpose: Reward token savings only if task is successful.

Formally: R_omit = (Tok_total - Tok_used) / Tok_total * I(R_task > 0)

Trainable Parameters: Full parameter fine-tuning

Training Data:

Synthetic 'cold-start' data created by identifying redundant turns via Monte Carlo rollouts
Includes single-turn omission samples (learning format) and multi-turn omission samples (learning continuity under missing context)

Key Hyperparameters:

reward_reweighting_mu: 0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolLight/DEPO: Optimizes both thought AND observation, not just thought length
vs. MEM-Agent: Learns adaptive omission policy via RL rather than relying on a separate summarizer model
vs. Observation-Mask: Policy-driven dynamic omission rather than static heuristics
+ 1 more
vs. ReSum: Selectively omits turns based on utility rather than compressing the entire trajectory equally

Limitations

Depends on synthetic cold-start data quality derived from rollouts
Omission reward is zero if task fails, potentially making exploration difficult in hard tasks
Evaluated primarily on 8B scale models; scaling behavior to larger models not explicitly detailed

Reproducibility

Code: https://github.com/usail-hkust/Agent-Omit

Code and data are publicly available at https://github.com/usail-hkust/Agent-Omit. Synthetic data construction relies on Monte Carlo rollouts which may be compute-intensive to replicate.

📊 Experiments & Results

Evaluation Setup

Agentic tasks across search, shopping, and embodied environments

Benchmarks:

DeepSearch (Information Seeking)
WebShop (Web Navigation/Shopping)
TextCraft (Digital Game (Crafting))
BabyAI (Embodied Decision Making)
SciWorld (Scientific Discovery)

Metrics:

Success Rate / Accuracy
Token Cost (Total Tokens)
Effectiveness-Efficiency Trade-off
Statistical methodology: Monte Carlo rollouts used for analysis; significance tests for main results not explicitly reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WebShop	Token Distribution (Thought)	100	45.1	N/A
Agent-Omit consistently achieves high accuracy comparable to frontier models while reducing costs.
Average across 5 benchmarks	Performance	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Analysis of token costs and utility on WebShop. (a) Token breakdown (Thought/Action/Obs). (b) Cost evolution over turns. (c) Accuracy contribution (Pass@8) of thoughts/observations at each turn.

Controlled intervention results: Accuracy vs Token Cost when explicitly omitting thoughts or observations at specific turns (1-8).

Main Takeaways

Thoughts are 'front-loaded' (crucial in early turns for planning), while observations stack linearly, becoming a burden in later turns.
Selective omission in intermediate turns reduces tokens without hurting accuracy; early/late turn omission is harmful.
Agent-Omit successfully learns to mimic these optimal omission patterns (omitting 3-4 intermediate turns) autonomously.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (policy, reward, trajectories)
LLM Agent architectures (Thought-Action-Observation loops)
Supervised Fine-Tuning (SFT)

Key Terms

Agentic Reinforcement Learning: A paradigm where agents improve their decision-making policies through interaction with an environment and feedback (rewards)

Pass@8: A metric measuring the probability that at least one correct solution is generated out of 8 attempts

KL-divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that stabilizes training by normalizing advantages within a group of samples

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a model on labeled examples to establish baseline behaviors

Monte Carlo rollouts: A technique to estimate the value of a current state by simulating many possible future trajectories from that state