Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu
University of Southern California
arXiv
(2026)
AgentRLReasoning
📝 Paper Summary
Human-Agent CollaborationMulti-Agent Systems (MAS)Continual Learning
HILA enables multi-agent systems to strategically defer to humans via a learned metacognitive policy, utilizing a dual-loop framework that optimizes deferral decisions with RL and capability growth with continual learning.
Core Problem
Autonomous multi-agent systems are 'closed-world,' bounded by their pre-training data, making them brittle on tasks requiring new knowledge or expertise not present in their training corpora.
Why it matters:
Purely autonomous agents cannot generate genuinely new knowledge, leading to collective failure on tasks requiring real-time info or domain expertise
Current human-in-the-loop methods rely on static heuristics (e.g., confidence thresholds) for deferral rather than learned policies
Existing feedback mechanisms treat human input as one-time fixes rather than supervised signals for long-term capability growth
Concrete Example:When a multi-agent system faces a problem requiring domain-specific expertise absent from its training data, internal collaboration merely recombines existing ignorance, leading to confident but wrong answers. HILA detects this uncertainty and triggers a 'Defer' action to a human expert, then learns from the expert's response.
Key Novelty
Dual-Loop Policy Optimization (DLPO) for Metacognitive Agents
Equips agents with a 'metacognitive policy' to decide between autonomous actions (Eval, Create) and strategic deferral (asking a human)
Separates optimization into two loops: an inner RL loop (GRPO) to learn *when* to ask, and an outer Continual Learning loop to learn *what* the expert demonstrated
Architecture
Overview of HILA and Dual-Loop Policy Optimization, illustrating the coupling of multi-agent collaboration with human interaction.
Breakthrough Assessment
8/10
Proposes a principled, mathematically grounded framework (Dual-Loop) for integrating human experts into MAS, moving beyond simple heuristics to learned metacognition and continual improvement.
⚙️ Technical Details
Problem Definition
Setting: Metacognitive Markov Decision Process (Meta-MDP)
Inputs: Shared cognitive state containing task context, self context (agent's own solution), and peer context (other agents' responses)
Outputs: Metacognitive action (Eval, Create, Defer) and resulting solution sequence
Pipeline Flow
Cognitive State Construction (Task + Self + Peer Context)
Metacognitive Policy (Selects Action: Eval, Create, or Defer)
Action Execution (Internal Generation or Human Call)
State Update
System Modules
Metacognitive Policy
Decides whether to exploit existing knowledge (Eval), explore new solutions (Create), or seek help (Defer)
Model or implementation: LLM-based Policy (Shared weights with generator)
Action Executor
Executes the chosen strategy: generates a solution, selects a peer solution, or invokes the expert
Model or implementation: LLM Generator or External Human Interface
Dual-Loop Optimizer
Updates model weights based on rewards (RL) and expert data (CL)
Model or implementation: Gradient-based Optimizer
Novel Architectural Elements
Metacognitive action space definition ({Eval, Create, Defer}) replacing standard token-level action space
Integration of structured cognitive signals (social consensus, monitoring, control) into the policy state via lightweight heuristics
Modeling
Base Model: Large Language Models (Specific architecture not explicitly named in text)
Training Method: Dual-Loop Policy Optimization (DLPO)
Objective Functions:
Purpose: Optimize the policy to balance success against cost.
Code is publicly available at https://github.com/USC-Melady/HILA.git. The text snippet provided does not specify model sizes, exact training data, or compute resources.
📊 Experiments & Results
Evaluation Setup
Evaluation on mathematical and problem-solving benchmarks (as mentioned in Abstract)
Benchmarks:
Not specified in snippet (Mathematical reasoning)
Not specified in snippet (General problem-solving)
Metrics:
Not reported in the paper
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
The framework establishes a principled foundation for agentic systems that can continually improve through human collaboration.
The proposed Dual-Loop Policy Optimization allows agents to balance the cost of human intervention against the risk of autonomous failure.
By treating expert feedback as supervision, the system transforms from a closed-world operator to an open-ended learner.
📚 Prerequisite Knowledge
Prerequisites
Multi-Agent Systems (MAS)
Reinforcement Learning (RL)
Markov Decision Processes (MDP)
Continual Learning
Key Terms
HILA: Human-In-the-Loop Multi-Agent Collaboration—the proposed framework for adaptive human-agent interaction
DLPO: Dual-Loop Policy Optimization—a training method combining inner-loop RL for decision-making and outer-loop supervised learning for knowledge acquisition
Meta-MDP: Metacognitive Markov Decision Process—a formalization where actions represent high-level cognitive strategies (e.g., creating vs. deferring) rather than just token generation
GRPO: Group Relative Policy Optimization—an RL algorithm used here to optimize the deferral policy by contrasting relative advantages of actions
Metacognitive Policy: A high-level policy that reasons about the agent's own competence and peer agreement to decide whether to act autonomously or seek help
SFT: Supervised Fine-Tuning—used in the outer loop to train the model on expert demonstrations provided during deferral