Augmenting Offline RL with Unlabeled Data

📝 Paper Summary

Offline Reinforcement Learning Teacher-Student Learning

Ludor is a teacher-student framework for offline RL that augments a student policy trained on limited labeled data with knowledge from a teacher policy trained on unlabeled data, using a policy similarity measure to mitigate extrapolation errors.

Core Problem

Offline RL methods often fail when encountering states or actions not present in the labeled dataset (Out-of-Distribution/OOD), and conservative approaches that stick strictly to the data perform poorly when crucial transitions are missing.

Why it matters:

Collecting comprehensive labeled data with rewards for every transition is often prohibitively expensive or impossible in real-world scenarios like robotics or navigation.
Existing conservative methods (like behavior regularization) assume the dataset covers all necessary actions, failing when the optimal path requires actions excluded from the labeled set.
Ignoring unlabeled data wastes potentially valuable information about environment dynamics and valid behaviors that could bridge gaps in the labeled dataset.

Concrete Example: In a navigation task, if labeled offline data only covers major city roads, a standard offline RL agent will never learn to use shorter side streets. Ludor allows the agent to learn these side streets from unlabeled driving logs (without rewards) to find better paths.

Key Novelty

Ludor (Teacher-Student Framework with Policy Discrepancy)

Trains a 'Teacher' network via Behavior Cloning on a secondary unlabeled dataset (e.g., expert or medium quality data without rewards) to capture general domain knowledge.
Trains a 'Student' network using standard offline RL on the labeled dataset, while simultaneously pulling its weights toward the Teacher's weights via Exponential Moving Average (EMA).
Introduces a 'Policy Discrepancy Measure' (cosine similarity) to weight the critic's loss, reducing the influence of OOD actions where the student deviates significantly from the teacher's known valid behavior.

Architecture

Overview of the Ludor framework showing the interaction between the Teacher network (trained on Unlabeled Data) and the Student network (trained on Offline RL Data).

Evaluation Highlights

Restores performance in data-scarce scenarios: When 60% of data is removed from the Walker2d task, Ludor maintains high scores while standard TD3BC drops from 93.21 to 2.68.
Outperforms baselines on D4RL benchmarks: Achieves higher normalized scores than TD3BC, IQL, and ORIL across various MuJoCo tasks (Hopper, Walker2d, HalfCheetah) in medium and expert settings.
Successfully integrates with multiple backbones: Demonstrates improvements when applied on top of both TD3BC and IQL algorithms.

Breakthrough Assessment

7/10

Offers a practical solution for the realistic setting where labeled data is scarce but unlabeled data is abundant. The proposed discrepancy measure cleverly balances learning from data vs. teacher priors.

⚙️ Technical Details

Problem Definition

Setting: Offline RL with two datasets: a labeled dataset D with rewards and an unlabeled dataset D_d without rewards.

Inputs: State s

Outputs: Action a

Pipeline Flow

Teacher Pretraining (Step 1) -> Teacher Training (Step 2) -> Knowledge Transfer (Step 3) -> Discrepancy Calculation (Step 4) -> Student Training (Step 5)

System Modules

Teacher Network

Learns general behavior priors from unlabeled data to guide the student

Model or implementation: Neural Network (Policy), same architecture as Student

Student Network (Actor)

Learns the optimal policy from labeled data while staying close to the teacher's knowledge

Model or implementation: Neural Network (Policy)

Policy Discrepancy Calculator

Computes similarity weight to modulate critic loss based on agreement between Student and Teacher

Model or implementation: Cosine Similarity Function

Novel Architectural Elements

Teacher-Student weight transfer via EMA in an Offline RL context where Teacher sees unlabeled data and Student sees labeled data
Integration of a non-probabilistic Policy Discrepancy Measure (cosine similarity) directly into the Critic's loss function to handle OOD data

Modeling

Base Model: MLP (standard for D4RL benchmarks), specific sizes not detailed but typically 2-3 layers of 256 units for these tasks

Training Method: Ludor (Teacher-Student Framework applied to TD3BC/IQL)

Objective Functions:

Purpose: Train Teacher to mimic unlabeled data.

Formally: BC loss minimizing MSE between Teacher output and unlabeled data actions.
Purpose: Transfer knowledge from Teacher to Student.

Formally: Student weights updated as phi^t = alpha * phi^t + (1-alpha) * sigma^t (EMA).
Purpose: Train Student Critic with discrepancy weighting.

Formally: Weighted TD error, where weight kappa = cosine_similarity(a, teacher(s)).
Purpose: Train Student Actor.

Formally: Standard actor loss (maximizing Q) plus optional BC regularization depending on base algorithm.

Training Data:

Labeled Data: Standard D4RL datasets (e.g., walker2d-medium-v2)
Unlabeled Data: D4RL datasets without rewards (e.g., walker2d-medium-replay-v2 or expert-v2)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
EMA_alpha: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. TD3BC/IQL: Ludor uses an external unlabeled dataset and a teacher network to guide the policy, whereas baselines use only the labeled dataset (or ignore the unlabeled portion).
vs. ORIL: ORIL tries to label the rewards of the unlabeled data (which is hard/noisy), while Ludor uses the unlabeled data solely for behavior priors (BC) and weight transfer, avoiding reward modeling errors.
vs. Lu et al.: Ludor extracts structural knowledge (dynamics/actions) from unlabeled data rather than treating it as low-value (zero reward) transitions.

Limitations

Depends on the quality/relevance of the unlabeled dataset; if the teacher learns bad behaviors, it might mislead the student.
Introduces additional complexity (training two networks, EMA tuning) compared to standard single-agent offline RL.
The paper does not report standard error or confidence intervals for the main results.
Key hyperparameters (EMA rate, network sizes) are missing from the text.

Reproducibility

No code provided. Hyperparameters (learning rate, batch size, EMA alpha) are not explicitly listed in the main text. Implementation would require standard D4RL baselines (TD3BC, IQL) and adding the teacher-student logic described in equations 3-8.

📊 Experiments & Results

Evaluation Setup

MuJoCo continuous control tasks from the D4RL benchmark.

Benchmarks:

D4RL (MuJoCo) (Continuous Control (Locomotion))

Metrics:

Normalized Score (0-100 scale based on random/expert returns)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on data-scarce settings where 60% of specific range data is removed from the offline dataset.
Walker2d (Missing Data)	Normalized Score	2.68	93.21	+90.53
Main results on D4RL datasets comparing Ludor against baselines.
Walker2d-medium-v2	Normalized Score	82.5	Not reported in the paper	Not reported in the paper

Main Takeaways

Traditional offline RL methods (TD3BC, IQL) are highly sensitive to missing data coverage, with performance dropping to near zero when specific state-action ranges are removed.
Ludor leverages unlabeled data to mitigate this OOD issue, using a teacher policy to provide 'practical domain knowledge' that fills gaps in the labeled dataset.
The method is algorithm-agnostic and can be applied on top of existing actor-critic methods like TD3BC and IQL.
Policy discrepancy measures (cosine similarity) help prevent the critic from overestimating values for actions that deviate from the teacher's trusted priors.

📚 Prerequisite Knowledge

Prerequisites

Offline Reinforcement Learning (Actor-Critic methods)
Behavior Cloning (BC)
Teacher-Student Architectures / Knowledge Distillation

Key Terms

OOD: Out-of-Distribution—states or actions not represented in the training dataset

TD3BC: Twin Delayed DDPG with Behavior Cloning—an offline RL algorithm that adds a behavior cloning regularization term to the policy update

IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying OOD actions by treating value estimation as a regression problem

EMA: Exponential Moving Average—a technique where model weights are updated as a weighted average of current and past weights, often used for stability

Behavior Cloning (BC): Supervised learning where a policy is trained to minimize the difference between its output actions and the actions in a dataset

Policy Discrepancy Measure: A metric (cosine similarity) introduced in this paper to quantify how different the student's action is from the teacher's action

Actor-Critic: An RL architecture where one network (Actor) learns the policy and another (Critic) estimates the value of states/actions