← Back to Paper List

Discovering Temporally-Aware Reinforcement Learning Algorithms

Matthew Jackson, Chris Lu, Louis Kirsch, R. Lange, Shimon Whiteson, J. Foerster
University of Oxford, The Swiss AI Lab IDSIA, Technical University Berlin
International Conference on Learning Representations (2024)
RL Benchmark

📝 Paper Summary

Meta-Reinforcement Learning Automated Algorithm Discovery
The paper introduces meta-learned reinforcement learning objective functions that explicitly condition on the agent's remaining training time, allowing the discovery of dynamic update rules that adapt schedules (like exploration-exploitation) over the agent's lifetime.
Core Problem
Existing meta-learned RL objective functions are static and myopic; they use the same update rule regardless of whether training is just starting or nearly finished, ignoring the optimization time horizon.
Why it matters:
  • Human learners and handcrafted algorithms (e.g., learning rate decay) heavily rely on schedules relative to the training horizon to maximize performance
  • Ignoring the time horizon restricts the expressivity of discovered algorithms, preventing them from learning behaviors like annealing exploration or 'end-game' risk aversion
Concrete Example: A student approaching an exam deadline changes their study strategy compared to the start of the semester. Similarly, an RL agent should explore highly uncertain actions early in training but exploit known rewards as the training budget runs out. Current meta-learned objectives like LPG treat step 1 and step 1,000,000 identically.
Key Novelty
Temporally-Adaptive Meta-RL Objectives (TA-LPG / TA-LPO)
  • Augment the input of the meta-learned loss function (e.g., the LSTM in LPG) with the agent's relative lifetime (current step / total steps) and total horizon
  • Use Evolution Strategies (ES) instead of truncated meta-gradients to optimize these functions, ensuring the meta-learner captures long-term dependencies across the entire agent lifetime rather than just a short unroll
Evaluation Highlights
  • TA-LPG achieves maximum performance on 'sparse' Grid-World tasks in 1/8th of the training steps required by the original LPG baseline
  • TA-LPO generalizes to out-of-distribution Brax environments (continuous control) despite being meta-trained only on discrete MinAtar SpaceInvaders
  • Analysis reveals the discovered algorithms spontaneously learn dynamic schedules, such as switching from optimism (entropy maximization) early in training to pessimism (entropy minimization) at the end
Breakthrough Assessment
7/10
Significant conceptual advance in making meta-learned algorithms dynamic rather than static. Demonstrates that simple temporal inputs + gradient-free optimization enable sophisticated emergent schedules.
×