Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

📝 Paper Summary

Goal-Conditioned Reinforcement Learning Offline Reinforcement Learning

QRL learns optimal goal-reaching value functions by constraining the search to quasimetric models and optimizing a novel objective that locally enforces transition costs while globally maximizing state separation.

Core Problem

Standard RL algorithms like Q-learning struggle to learn optimal value functions in goal-reaching settings because they do not exploit the inherent quasimetric geometry (triangle inequality and asymmetry) of the optimal value function.

Why it matters:

Goal-reaching is a fundamental problem in robotics and planning, requiring accurate cost-to-go estimates for arbitrary start-goal pairs.
Existing methods either fail to capture the optimal value (learning on-policy values instead) or suffer from slow convergence and instability when simply plugging quasimetric models into traditional Bellman updates.
Symmetric metric approaches fail in environments with irreversible dynamics (e.g., gravity, cliffs), while standard Q-learning is inefficient in multi-goal settings.

Concrete Example: In a discretized MountainCar environment, standard Q-learning fails to learn the correct distance structure for reaching the hilltop from various states. Contrastive RL learns a symmetric or on-policy value that doesn't reflect optimal behavior. QRL accurately recovers the 'kidney bean' shape of the true optimal value function, reflecting that gaining momentum is necessary.

Key Novelty

Quasimetric Reinforcement Learning (QRL)

Models the value function as a quasimetric (a distance metric allowing asymmetry), which is the mathematically exact structure of optimal goal-conditioned value functions.
Uses a 'rubber band' intuition: enforces local consistency with transition costs (don't overestimate local steps) while pushing all other state pairs as far apart as possible (maximizing the metric).
Provably recovers the optimal value function because the 'tightest' quasimetric that satisfies local transition constraints corresponds exactly to the shortest path costs.

Architecture

Conceptual illustration of the QRL objective using a 'rubber band' analogy.

Evaluation Highlights

Outperforms Contrastive RL and standard Q-learning by large margins on discretized MountainCar, achieving >95% success vs <20% for Diffuser in multi-goal settings.
+37% improvement over the best baseline (CQL) and +46% over handcoded controllers on offline Maze2D tasks.
Up to 4.9x improved sample efficiency in online goal-reaching benchmarks (state-based and image-based) compared to baselines like standard Q-learning with quasimetric models.

Breakthrough Assessment

8/10

Strong theoretical grounding linking value functions to quasimetrics, combined with a novel objective that departs from standard Bellman updates. empirical results show significant gains in sample efficiency and optimality.

⚙️ Technical Details

Problem Definition

Setting: Goal-reaching Markov Decision Processes (MDPs) with deterministic dynamics

Inputs: Current state s and goal state g (can be state vectors or images)

Outputs: Optimal value V*(s, g) representing the negative cost-to-go (shortest path distance)

Pipeline Flow

Input (s, g) -> Encoder -> Latent Space
Latent Space -> Quasimetric Head -> Distance Estimate (Value)
Optimization -> Maximize global distances subject to local transition constraints

System Modules

Encoder

Maps raw states (vectors or images) to a latent space Z

Model or implementation: Deep Neural Network (architecture depends on input type)

Quasimetric Head

Computes the asymmetric distance between two latent vectors

Model or implementation: Interval Quasimetric Embedding (IQE)

Transition Model

Predicts next latent state to estimate Q-values from V-values

Model or implementation: Learned latent transition function T

Novel Architectural Elements

Integration of Interval Quasimetric Embeddings (IQE) directly as the value function estimator within an RL loop.
Dual optimization objective that maximizes global distances while constraining local transitions.

Modeling

Base Model: Interval Quasimetric Embeddings (IQE) for value function; MLP or CNN encoders depending on task

Training Method: Constrained Optimization via Dual Gradient Descent

Objective Functions:

Purpose: Maximize the estimated distance between random state-goal pairs (pushing states apart).

Formally: max_theta E[d_theta(s, g)].
Purpose: Constrain local distances to be consistent with transition costs (triangle inequality anchor).

Formally: d_theta(s, s') <= -reward(s, s').
Purpose: Learn latent transitions for Q-value estimation.

Formally: Minimize L2 distance between predicted next latent and actual next latent in the quasimetric space.

Training Data:

Offline datasets (e.g., D4RL Maze2D)
Online replay buffers (e.g., Fetch, Hand manipulation)

Key Hyperparameters:

constraint_relaxation_epsilon: Small positive value (e.g., 1e-4 implied by theory, tuned in practice)
lambda: Lagrange multiplier for the constraint term (learned)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Contrastive RL: QRL learns optimal V* (shortest path) via maximization + constraints, whereas Contrastive RL approximates on-policy V_pi.
vs. Quasimetric Q-Learning: QRL uses a global maximization objective rather than bootstrapping/Bellman updates, avoiding instability in iterative updates.
vs. Diffuser: QRL provides a value function that can guide planning or policy learning directly, often with better sample efficiency.
+ 1 more
vs. Generalized Dual Models [not cited in paper]: QRL explicitly uses quasimetric structural constraints (triangle inequality) in the model architecture rather than just the objective.

Limitations

Assumes deterministic environment dynamics for the theoretical derivation.
Requires a valid quasimetric model family (like IQE) which may add architectural complexity.
Maximization objective requires careful balancing with the constraint term (via Lagrange multiplier) to avoid divergence.

Reproducibility

Code: https://github.com/quasimetric-learning/quasimetric-rl

Code is publicly available at github.com/quasimetric-learning/quasimetric-rl. Paper provides theoretical proofs in appendix. Experiments cover both offline and online settings with standard benchmarks.

📊 Experiments & Results

Evaluation Setup

Goal-reaching tasks in offline and online RL settings.

Benchmarks:

Discretized MountainCar (Classic Control (Navigation)) [New]
D4RL Maze2D (Offline Navigation)
Fetch & Shadow Hand (Robotic Manipulation (Online))

Metrics:

Success Rate
Normalized Return
Sample Efficiency (Steps to convergence)
Statistical methodology: Means and standard deviations over 5 seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Control results on Discretized MountainCar showing QRL's superiority in multi-goal settings.
MountainCar (Reach 9 States)	Normalized Return	53.75	85.55	+31.80
MountainCar (Multi-Goal)	Normalized Return	73.27	85.55	+12.28
Offline RL performance on D4RL Maze2D datasets.
Online RL sample efficiency on robotic tasks.

Experiment Figures

Visualizations of learned value functions on Discretized MountainCar for various methods compared to Ground Truth.

Online learning curves (Success Rate vs Environment Steps) for Fetch and Hand manipulation tasks.

Main Takeaways

QRL consistently learns the optimal value function structure (visualized as kidney bean shape in MountainCar) where other methods fail or learn on-policy approximations.
The combination of quasimetric architecture AND the specific maximization-constrained objective is crucial; neither component works as well in isolation (e.g., Q-Learning + Quasimetric is slower and less accurate).
QRL generalizes effectively to high-dimensional image-based observations (e.g., Fetch image tasks), maintaining performance advantages over baselines.
In offline settings, QRL-learned values serve as excellent heuristics for trajectory planners (MPPI), significantly boosting performance over standalone planners or standard offline RL.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDPs)
Metric and Quasimetric spaces
Q-Learning and Bellman updates
Contrastive Learning

Key Terms

Quasimetric: A distance function d(x, y) that satisfies the triangle inequality and d(x, x)=0, but is not necessarily symmetric (d(x, y) != d(y, x)).

Triangle Inequality: The property that the distance from A to C is never greater than the trip from A to B plus B to C.

Optimal Value Function V*: The maximum expected return (or minimum cost) achievable by any policy from a state to a goal.

IQE: Interval Quasimetric Embeddings—a specific neural network architecture designed to output valid quasimetrics.

HER: Hindsight Experience Replay—an RL technique where past experiences are replayed with different goals to learn more efficiently.

CQL: Conservative Q-Learning—an offline RL algorithm that regularizes Q-values to prevent overestimation on unseen actions.

MSG: Model Standard-deviation Gradients—an ensemble-based offline RL method.

Diffuser: A trajectory-based diffusion model for planning in RL.