A reinforcement learning framework that accelerates early training using heuristic guidance (distance and angle to goal) which decays over time to prevent permanent human bias.
Core Problem
Reinforcement learning agents struggle with sparse rewards during early training because random exploration rarely discovers positive states, while permanent heuristics introduce suboptimal human bias.
Why it matters:
Robots learning from scratch (tabula rasa) waste significant time in 'zigzag' learning before finding any positive signal
Environments like Lunar Lander only provide substantial rewards upon successful termination, making early random exploration inefficient
Existing heuristic methods often introduce permanent bias, preventing the agent from learning the true optimal policy once it has collected enough data
Concrete Example:In the Lunar Lander game, a randomly acting agent will almost never land safely on the pad, receiving mostly crash penalties (-100). If a fixed heuristic reward is added for staying upright, the agent might learn to just hover indefinitely to maximize that heuristic rather than actually landing.
Key Novelty
Vanishing Bias Heuristic RL
Introduces a 'vanishing bias' mechanism where heuristic rewards (e.g., distance to goal) are added to the objective only during the early stages of training to guide exploration
Applies a geometric decay factor to the heuristic weight over time, ensuring that human-defined bias gradually disappears and the agent eventually optimizes the true environmental reward
Architecture
The logic flow for Heuristic DQN, showing how the heuristic term is integrated into the target calculation
⚙️ Technical Details
Problem Definition
Setting: Markov Decision Process (MDP) with unknown motion model and reward function
Inputs: State vector x containing 8 variables (position, velocity, angle, angular velocity, leg contacts)
Outputs: Discrete action u from set {0, 1, 2, 3} (fire engines or do nothing)
state_space_bounds: Velocity in [-10^4, 10^4], Position in [-1, 1]
Compute: Not reported in the paper
Comparison to Prior Work
vs. DQN/Double DQN: Proposed method adds a heuristic term to the reward function that decays over time.
vs. UAV Heuristics [10]: Proposed method decays the heuristic ('vanishing bias') whereas [10] uses it throughout training, potentially leading to suboptimal convergence.
Limitations
Reliance on manually designed heuristic functions requires domain knowledge (e.g., knowing landing pad location)
Provided text does not include quantitative experimental results to verify claims
Heuristic definition is specific to the Lunar Lander task and may not generalize automatically to other environments
Reproducibility
The paper relies on a standard Lunar Lander environment (OpenAI Gym) and a public DQN implementation (referenced in text). However, specific hyperparameters for the heuristic constants (k1, k2) and the decay rate (p) are defined symbolically but exact numerical values used in experiments are not provided in the snippet.
📊 Experiments & Results
Evaluation Setup
Lunar Lander V2 game environment from OpenAI Gym
Benchmarks:
Lunar Lander V2 (Continuous Control / Navigation)
Metrics:
Reward (implied from problem formulation)
Landing success (implied)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
The paper proposes that heuristic guidance is critical for the 'early stage' of training where sparse rewards make random exploration inefficient
The 'vanishing bias' mechanism is theoretically argued to allow the agent to eventually learn the true optimal policy by removing the heuristic influence, unlike methods that bake heuristics into the reward function permanently
Note: The provided text description ends before the experimental results section, so specific numerical performance gains over baselines (DQN, SARSA) are not available for extraction.
Basic control theory (state space, heuristic search)
Key Terms
MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker
DQN: Deep Q-Network—a reinforcement learning algorithm that uses a neural network to estimate the value of taking specific actions in specific states
SARSA: State-Action-Reward-State-Action—an on-policy reinforcement learning algorithm used to learn a Markov Decision Process policy
Tile Coding: A discretization technique used to convert continuous state spaces into binary feature vectors (grids) for classical RL algorithms
Heuristic: A rule-of-thumb strategy (like 'move closer to the goal') used to guide the algorithm, which may not be perfect but speeds up finding a solution
Vanishing Bias: The paper's proposed technique where the influence of the heuristic on the learning process is gradually reduced to zero over time
Robbins-Monro criterion: A mathematical condition required for the learning rate schedule to guarantee that a stochastic approximation algorithm will eventually converge