Discovering Temporally-Aware Reinforcement Learning Algorithms

📝 Paper Summary

Meta-Reinforcement Learning Automated Algorithm Discovery

The paper introduces meta-learned reinforcement learning objective functions that explicitly condition on the agent's remaining training time, allowing the discovery of dynamic update rules that adapt schedules (like exploration-exploitation) over the agent's lifetime.

Core Problem

Existing meta-learned RL objective functions are static and myopic; they use the same update rule regardless of whether training is just starting or nearly finished, ignoring the optimization time horizon.

Why it matters:

Human learners and handcrafted algorithms (e.g., learning rate decay) heavily rely on schedules relative to the training horizon to maximize performance
Ignoring the time horizon restricts the expressivity of discovered algorithms, preventing them from learning behaviors like annealing exploration or 'end-game' risk aversion

Concrete Example: A student approaching an exam deadline changes their study strategy compared to the start of the semester. Similarly, an RL agent should explore highly uncertain actions early in training but exploit known rewards as the training budget runs out. Current meta-learned objectives like LPG treat step 1 and step 1,000,000 identically.

Key Novelty

Temporally-Adaptive Meta-RL Objectives (TA-LPG / TA-LPO)

Augment the input of the meta-learned loss function (e.g., the LSTM in LPG) with the agent's relative lifetime (current step / total steps) and total horizon
Use Evolution Strategies (ES) instead of truncated meta-gradients to optimize these functions, ensuring the meta-learner captures long-term dependencies across the entire agent lifetime rather than just a short unroll

Evaluation Highlights

TA-LPG achieves maximum performance on 'sparse' Grid-World tasks in 1/8th of the training steps required by the original LPG baseline
TA-LPO generalizes to out-of-distribution Brax environments (continuous control) despite being meta-trained only on discrete MinAtar SpaceInvaders
Analysis reveals the discovered algorithms spontaneously learn dynamic schedules, such as switching from optimism (entropy maximization) early in training to pessimism (entropy minimization) at the end

Breakthrough Assessment

7/10

Significant conceptual advance in making meta-learned algorithms dynamic rather than static. Demonstrates that simple temporal inputs + gradient-free optimization enable sophisticated emergent schedules.

⚙️ Technical Details

Problem Definition

Setting: Meta-Learning for Reinforcement Learning (discovering parametric objective functions to maximize agent return over a lifetime)

Inputs: Agent transition data (state, action, reward) + Temporal information (current step n, total steps N)

Outputs: Update targets for policy/critic (TA-LPG) or drift values for Mirror Learning (TA-LPO)

Pipeline Flow

Agent collects experience -> Meta-Learned Objective Function (conditioned on time) computes Loss -> Agent updates parameters -> Repeat until Horizon
Meta-Loop: Evaluate final agent return -> Update Objective Function parameters using ES

System Modules

TA-LPG Objective (Loss Function)

Generates update targets for the agent's policy and critic

Model or implementation: LSTM (Long Short-Term Memory)

TA-LPO Objective (Loss Function)

Computes the drift term for Mirror Learning updates (constraining policy change)

Model or implementation: MLP (Multi-Layer Perceptron) with no bias

Novel Architectural Elements

Augmenting meta-learner inputs with explicit lifetime fractions (n/N) and log-horizon (log(N)) to enable schedule discovery
Interaction terms in LPO input vector (multiplying standard features by n/N) to allow time-dependent drift functions

Modeling

Base Model: Agent: Standard Actor-Critic (LPG) or PPO-like (LPO) architectures

Training Method: Evolution Strategies (ES) for Meta-Optimization

Objective Functions:

Purpose: Maximize agent return at the end of training.

Formally: F(phi) = E[J(pi_theta_K)] where K is the total training steps.

Key Hyperparameters:

LPG_optimizer: Adam
LPG_learning_rate: 1e-4
ES_population_size: 64 (LPO), 512 (LPG parallel lifetimes)
+ 3 more
ES_sigma_init: 0.003 (LPG), 0.04 (LPO)
meta_training_horizon_LPO: 1e7 timesteps (MinAtar)
meta_training_horizon_LPG: Variable (Grid-Worlds)

Compute: Meta-training used 2 A100 GPUs (LPO). Training implemented in JAX.

Comparison to Prior Work

vs. LPG/LPO: TA-variants explicitly condition on time (n/N) and use ES to optimize over the full horizon instead of truncated meta-gradients
vs. PPO: TA-LPO discovers a dynamic objective that changes from 'optimistic' to 'pessimistic' over time, whereas PPO's clipping is static
vs. MetaGenRL: TA-methods use ES to avoid myopic bias from truncated backpropagation through time

Limitations

Robustness guarantees for the complex LSTM-based LPG are difficult to obtain compared to the constrained LPO
Meta-training can be computationally expensive due to full lifetime rollouts required for ES fitness evaluation
LPO search space is restricted by Mirror Learning constraints, potentially limiting expressivity compared to unconstrained losses

Reproducibility

Code: https://github.com/EmptyJackson/groove

Code publicly available at https://github.com/EmptyJackson/groove. Hyperparameters for LPG, LPO, and ES are fully detailed in Appendix A. Grid-World and MinAtar environments are standard. Brax is used for out-of-distribution evaluation.

📊 Experiments & Results

Evaluation Setup

Meta-training on simple environments (Grid-Worlds or MinAtar), then evaluating generalization to unseen environments and horizons.

Benchmarks:

Grid-Worlds (Discrete navigation (Variable horizons))
MinAtar (Discrete control (Atari-like))
Brax (Continuous control (Physics))

Metrics:

Final Return (often normalized against A2C or raw score)
Training Efficiency (steps to convergence)
Statistical methodology: Mean evaluation return across 3 random seeds with standard error bars (Figures 6/7).

Experiment Figures

Training curves comparing TA-LPG and LPG on Grid-Worlds across different training horizons.

Visualization of the learned TA-LPO drift function derivative at different training stages (Start, Middle, End).

Comparison of TA-LPG trained with ES vs. Meta-gradients.

Main Takeaways

TA-LPG consistently outperforms LPG across all training horizons on held-out Grid-Worlds, often reaching peak performance significantly faster (e.g., 1/8th the steps for 'sparse' tasks).
TA-LPO generalizes effectively to continuous control tasks (Brax) and other MinAtar games despite being meta-trained only on MinAtar SpaceInvaders, outperforming PPO and LPO.
Meta-gradient optimization fails to discover temporally-aware strategies because truncated backpropagation is myopic; Evolution Strategies (ES) are essential for learning lifetime-dependent schedules.
Qualitative analysis shows TA-LPO discovers an 'asymmetric rollback schedule': it is optimistic (high entropy) early in training and pessimistic (risk-averse, low entropy) late in training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy Gradients, Actor-Critic)
Meta-Learning / Learning to Learn
Evolution Strategies (ES)

Key Terms

LPG: Learned Policy Gradient—a meta-RL method that learns a critic producing 'bootstrap vectors' to supervise the actor

LPO: Learned Policy Optimization—a meta-RL method that learns a 'drift function' to constrain policy updates, generalizing PPO

Meta-gradients: Optimizing meta-parameters by differentiating through the inner-loop learning process (Backpropagation Through Time)

Evolution Strategies (ES): A black-box optimization method that estimates gradients by perturbing parameters and measuring fitness, avoiding the need for differentiable inner loops

Drift function: In Mirror Learning, a function measuring the divergence between new and old policies; LPO parameterizes this with a neural network

Myopic: Short-sighted; in this context, optimization that only considers immediate improvement rather than final performance at the end of training

Truncated Backpropagation: Stopping gradient calculation after a fixed number of steps to save memory, which prevents learning long-term dependencies

Antithetic sampling: A variance reduction technique in ES where perturbations are evaluated in pairs (x + noise, x - noise)

Rollback: Reversing a gradient update; observed in TA-LPO where the objective penalizes certain updates aggressively