Vanishing Bias Heuristic-guided Reinforcement Learning Algorithm

📝 Paper Summary

Reinforcement Learning Heuristic-Guided RL

A reinforcement learning framework that accelerates early training using heuristic guidance (distance and angle to goal) which decays over time to prevent permanent human bias.

Core Problem

Reinforcement learning agents struggle with sparse rewards during early training because random exploration rarely discovers positive states, while permanent heuristics introduce suboptimal human bias.

Why it matters:

Robots learning from scratch (tabula rasa) waste significant time in 'zigzag' learning before finding any positive signal
Environments like Lunar Lander only provide substantial rewards upon successful termination, making early random exploration inefficient
Existing heuristic methods often introduce permanent bias, preventing the agent from learning the true optimal policy once it has collected enough data

Concrete Example: In the Lunar Lander game, a randomly acting agent will almost never land safely on the pad, receiving mostly crash penalties (-100). If a fixed heuristic reward is added for staying upright, the agent might learn to just hover indefinitely to maximize that heuristic rather than actually landing.

Key Novelty

Vanishing Bias Heuristic RL

Introduces a 'vanishing bias' mechanism where heuristic rewards (e.g., distance to goal) are added to the objective only during the early stages of training to guide exploration
Applies a geometric decay factor to the heuristic weight over time, ensuring that human-defined bias gradually disappears and the agent eventually optimizes the true environmental reward

Architecture

The logic flow for Heuristic DQN, showing how the heuristic term is integrated into the target calculation

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) with unknown motion model and reward function

Inputs: State vector x containing 8 variables (position, velocity, angle, angular velocity, leg contacts)

Outputs: Discrete action u from set {0, 1, 2, 3} (fire engines or do nothing)

Pipeline Flow

State Observation -> Heuristic Calculation -> Reward Modification -> Q-Value Update -> Action Selection

System Modules

Q-Function Estimator

Estimate the long-term value of taking an action in a given state

Model or implementation: Neural Network (DQN variants) or Tile Coding Table (Classical)

Heuristic Guide (Training / Reward Shaping)

Calculate an additional reward signal based on prior knowledge to guide early exploration

Model or implementation: Mathematical function (Distance + Angle penalty)

Decay Scheduler (Training / Reward Shaping)

Reduce the weight of the heuristic over time to remove bias

Model or implementation: Geometric decay

Novel Architectural Elements

Integration of a time-decaying heuristic term directly into the target Q-value calculation: r_t - alpha_t * h(x, x')

Modeling

Base Model: DQN (Deep Q-Network) and Classical Q-Learning variants

Training Method: Q-Learning with Heuristic-Modified Reward

Objective Functions:

Purpose: Maximize expected cumulative reward.

Formally: Optimize policy pi* to maximize E[sum(gamma^t * l(x_t, u_t))]
Purpose: Incorporate decaying heuristic guidance into Q-learning target.

Formally: Q_target = (r_t - alpha_t * h(x_t, x_{t+1})) + gamma * max(Q(x_{t+1}, a'))

Key Hyperparameters:

heuristic_decay_factor: p (0 < p < 1)
heuristic_weight_alpha: Initialized to positive value, decays over time
discount_factor_gamma: In (0, 1)
+ 1 more
state_space_bounds: Velocity in [-10^4, 10^4], Position in [-1, 1]

Compute: Not reported in the paper

Comparison to Prior Work

vs. DQN/Double DQN: Proposed method adds a heuristic term to the reward function that decays over time.
vs. UAV Heuristics [10]: Proposed method decays the heuristic ('vanishing bias') whereas [10] uses it throughout training, potentially leading to suboptimal convergence.

Limitations

Reliance on manually designed heuristic functions requires domain knowledge (e.g., knowing landing pad location)
Provided text does not include quantitative experimental results to verify claims
Heuristic definition is specific to the Lunar Lander task and may not generalize automatically to other environments

Reproducibility

The paper relies on a standard Lunar Lander environment (OpenAI Gym) and a public DQN implementation (referenced in text). However, specific hyperparameters for the heuristic constants (k1, k2) and the decay rate (p) are defined symbolically but exact numerical values used in experiments are not provided in the snippet.

📊 Experiments & Results

Evaluation Setup

Lunar Lander V2 game environment from OpenAI Gym

Benchmarks:

Lunar Lander V2 (Continuous Control / Navigation)

Metrics:

Reward (implied from problem formulation)
Landing success (implied)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper proposes that heuristic guidance is critical for the 'early stage' of training where sparse rewards make random exploration inefficient
The 'vanishing bias' mechanism is theoretically argued to allow the agent to eventually learn the true optimal policy by removing the heuristic influence, unlike methods that bake heuristics into the reward function permanently
Note: The provided text description ends before the experimental results section, so specific numerical performance gains over baselines (DQN, SARSA) are not available for extraction.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (MDPs, Rewards, Policies)
Q-Learning and DQN architectures
Basic control theory (state space, heuristic search)

Key Terms

MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker

DQN: Deep Q-Network—a reinforcement learning algorithm that uses a neural network to estimate the value of taking specific actions in specific states

SARSA: State-Action-Reward-State-Action—an on-policy reinforcement learning algorithm used to learn a Markov Decision Process policy

Tile Coding: A discretization technique used to convert continuous state spaces into binary feature vectors (grids) for classical RL algorithms

Heuristic: A rule-of-thumb strategy (like 'move closer to the goal') used to guide the algorithm, which may not be perfect but speeds up finding a solution

Vanishing Bias: The paper's proposed technique where the influence of the heuristic on the learning process is gradually reduced to zero over time

Robbins-Monro criterion: A mathematical condition required for the learning rate schedule to guarantee that a stochastic approximation algorithm will eventually converge