← Back to Paper List

Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents

Yuanhao Li, Haozhe Wang, Geyong Min, Nektarios Georgalas, Wang Miao
Department of Computer Science
arXiv (2026)
Agent Memory RL

📝 Paper Summary

Memory internalization Self-evolving Agentic reasoning
A self-finetuning framework that enables agents to learn continuous network control by distilling linguistic reflections into model parameters, overcoming context window limits and replacing handcrafted rewards.
Core Problem
LLM agents in continuous control tasks struggle with finite context windows (forgetting long-term history) and lack of explicit reward signals, while RL requires laborious reward engineering.
Why it matters:
  • 6G networks require persistent, autonomous adaptation to dynamic traffic, which exceeds the short-term memory of prompt-based agents
  • Handcrafting rewards for multi-objective problems like RAN slicing is error-prone and limits scalability
  • Long Context Degradation prevents standard LLMs from utilizing extensive interaction history for decision improvement
Concrete Example: In RAN slicing, an agent maximizing throughput might neglect reconfiguration costs. A prompt-based LLM would eventually truncate the history of these expensive adjustments, repeating the mistake, whereas this approach internalizes the penalty into the model weights.
Key Novelty
Refine-from-Reflection (RfR) Framework
  • Replaces scalar rewards with a 'Reflector' that generates linguistic feedback and preference labels on trajectories
  • Replaces prompt-based memory with parameter-based memory by fine-tuning the agent on these self-generated preferences using KTO (Kahneman-Tversky Optimization)
  • Formalizes the 'Reflective MDP' where agents output actions, reflections, and analyses rather than just actions
Evaluation Highlights
  • Outperforms standard Reinforcement Learning (RL) baselines in sample efficiency and stability
  • Outperforms existing LLM-based agents (like Reflexion) which suffer from context limitations
  • Demonstrates effective multi-objective optimization (balancing spectrum efficiency, QoS, and stability) without handcrafted reward functions
Breakthrough Assessment
8/10
Proposes a significant architectural shift from in-context learning to self-finetuning for continuous control, addressing the fundamental 'context bottleneck' of LLM agents in lifelong scenarios.
×