Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Offline Reinforcement Learning

Uni-RLHF is a comprehensive open-source system providing annotation tools, large-scale crowdsourced datasets, and modular baselines to standardize research on RLHF with diverse real-world human feedback.

Core Problem

Current RLHF research relies on synthetic feedback from scripted teachers rather than real humans, and lacks standardized tools for diverse feedback types (like visual or attribute-based guidance).

Why it matters:

Synthetic labels from scripted teachers fail to capture human irrationality, bias, and cognitive inconsistency found in real-world applications
Existing benchmarks assume infinite, unbiased expert feedback, ignoring the practical challenges of noisy crowdsourced data
Lack of unified platforms forces researchers to build custom annotation tools for every new feedback type, slowing down progress

Concrete Example: In autonomous driving (SMARTS), a hand-designed reward function might incentivize speed but neglect comfort. A scripted teacher simply optimizes this flawed function. In contrast, real humans notice unsafe merging behavior that the reward function misses, but collecting this feedback requires complex custom tools which Uni-RLHF provides.

Key Novelty

Universal RLHF Ecosystem (Platform + Data + Baselines)

Unified annotation interface supporting five distinct feedback types (comparative, attribute, evaluative, visual, keypoint) compatible with standard RL environments like Gym and DMControl
Systematic crowdsourcing pipeline with 'ex-ante' filters (expert validation sets) to ensure label quality from non-expert workers
Release of 30 reusable datasets (15 million steps) with real human labels, enabling offline RLHF benchmarking without collecting new data

Architecture

The complete Uni-RLHF system workflow, separated into three components: Platform, Datasets, and Offline RLHF Baselines.

Evaluation Highlights

Offline RL policies trained on crowdsourced data (CS) achieve parity with or outperform oracle rewards on 5 Atari games (e.g., +3000 points on Qbert vs ST)
IQL trained on human preferences (IQL-CS) matches oracle performance on complex D4RL locomotion tasks (e.g., Hopper-medium-replay: 95.11 CS vs 97.43 Oracle)
Annotation pipeline with filters achieves 98% agreement with expert labels, validating the crowdsourcing methodology

Breakthrough Assessment

9/10

A foundational infrastructure contribution. By open-sourcing a versatile platform and 15M steps of real human data, it significantly lowers the barrier to entry for realistic RLHF research, moving the field away from synthetic proxies.

⚙️ Technical Details

Problem Definition

Setting: Offline Reinforcement Learning from Human Feedback (Offline RLHF)

Inputs: Dataset of trajectory segments D_traj and a set of human annotations D_human corresponding to those segments

Outputs: A policy π trained to maximize the learned reward function r_psi derived from D_human

Pipeline Flow

Environment/Dataset Loader (fetches trajectories)
Query Sampler (selects segments for annotation)
Annotation Interface (captures human input)
Feedback Translator (standardizes labels)
Offline RLHF Training (Reward Modeling + Policy Learning)

System Modules

Query Sampler

Selects which trajectory segments to present to annotators

Model or implementation: Algorithmic selection (Random, Disagreement, Schedule, or Custom)

Annotation Interface

Visualizes segments and records user feedback

Model or implementation: Vue.js Frontend

Feedback Translator

Encodes raw user inputs into standardized label formats for training

Model or implementation: Deterministic encoding logic

Novel Architectural Elements

Universal API abstraction decoupling the annotation frontend from specific RL environments, enabling seamless support for Gym, DMControl, and SMARTS
Modular feedback translator that standardizes heterogeneous inputs (rankings, bounding boxes, attributes) into a unified storage format

Modeling

Base Model: Reward Models: MLP (3-layer), CNN (custom 4-layer), or Transformer (GPT-2 based)

Training Method: Offline Reinforcement Learning on Learned Rewards

Objective Functions:

Purpose: Train reward model to align with human preferences.

Formally: Minimize cross-entropy loss L_reward = -E_{(σ0, σ1, y) ~ D}[y * log P(σ0 > σ1) + (1-y) * log P(σ1 > σ0)] where P is modeled via Bradley-Terry.
Purpose: Train policy to maximize learned reward.

Formally: Standard offline RL objectives (IQL/CQL/TD3BC) using the learned reward function r_psi(s,a) instead of environment reward.

Training Data:

30 datasets across D4RL, Atari, SMARTS
Total ~15 million steps annotated
Feedback collected from ~100 paid crowdsourced workers

Key Hyperparameters:

learning_rate: 3e-4 (Reward Model)
batch_size: 256
segment_length: 100 (Locomotion), 25 (Atari/SMARTS)
+ 1 more
discount_factor: 0.99

Compute: Not reported in the paper

Comparison to Prior Work

vs. PEBBLE/SURF: Uni-RLHF provides a unified platform and real human datasets rather than just a specific algorithm; supports diverse feedback beyond just preferences
vs. TAMER/COACH: Uni-RLHF targets offline RL settings and crowdsourcing scales, whereas TAMER/COACH often focus on real-time interaction with single users
vs. RLHF-Blender: Uni-RLHF includes large-scale pre-collected datasets and extensive benchmarking of offline RL algorithms, while Blender is primarily an interface tool [not cited in paper]

Limitations

Benchmarking primarily focuses on Comparative Feedback, with less extensive quantitative analysis for the other 4 feedback types
Does not deeply model annotator irrationality beyond simple filtering, leaving advanced noise modeling for future work
Experiments are limited to offline RL settings; online RLHF performance with the platform is supported but less rigorously benchmarked

Reproducibility

Code: https://github.com/Uni-RLHF/Uni-RLHF

All artifacts are open source. Platform code, 30 annotated datasets, and baseline implementations are available at https://github.com/Uni-RLHF/Uni-RLHF. The paper provides detailed hyperparameters for all baselines (IQL, CQL, TD3BC) in the Appendix.

📊 Experiments & Results

Evaluation Setup

Offline RL on D4RL (Mujoco, Antmaze, Adroit), Atari, and SMARTS tasks using learned reward models

Benchmarks:

D4RL (Locomotion & Manipulation)
D4RL Atari (Visual Arcade Games)
SMARTS (Autonomous Driving)

Metrics:

Expert-Normalized Score
Statistical methodology: Means and standard deviations over 3 seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of offline RL algorithms trained with Crowdsourced (CS) feedback vs. Scripted Teacher (ST) feedback and Oracle rewards.
D4RL hopper-medium-replay	Normalized Score	64.42	95.11	+30.69
D4RL walker2d-medium-replay	Normalized Score	60.33	73.09	+12.76
Atari Qbert-m	Raw Score	8170.8	11301.7	+3130.9
Atari Breakout-m	Raw Score	3.0	33.5	+30.5

Experiment Figures

Learning curves for Attribute Feedback experiments on Walker task, showing multi-objective optimization

Main Takeaways

Real human feedback (CS) often outperforms synthetic scripted teachers (ST), especially in visual tasks (Atari) where pixel-level nuances matter
IQL is the most robust backbone for RLHF, showing consistent performance across tasks compared to CQL which occasionally collapses
Transformer-based reward models (TFM) generally offer better stability and performance than MLP models in sparse reward settings
The crowdsourcing pipeline with ex-ante filters is effective, producing high-quality labels that enable competitive RL performance without expert annotators

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy, Reward)
Offline RL algorithms (IQL, CQL, TD3BC)
Reward Modeling (Bradley-Terry model for preferences)

Key Terms

RLHF: Reinforcement Learning with Human Feedback—training agents using guidance from humans rather than pre-defined reward functions

Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another, commonly used to train reward models from comparison data

IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying out-of-distribution actions by treating value estimation as a supervised learning problem

CQL: Conservative Q-Learning—an offline RL algorithm that learns a conservative lower bound on the value function to prevent overestimation

TD3BC: TD3 with Behavior Cloning—a minimalist offline RL algorithm that adds a behavior cloning regularization term to the standard TD3 objective

Oracle: A model trained using the ground-truth, hand-engineered reward function provided by the environment simulator

ST: Scripted Teacher—synthetic feedback generated by a programmed agent that perfectly follows the ground-truth reward function

CS: Crowd-Sourced—feedback labels collected from real human workers via the Uni-RLHF platform

ex-ante filters: Quality control mechanisms applied *during* or *before* data collection (like qualifying exams or real-time validation) to filter out bad annotators

saliency map: A heatmap representation highlighting which parts of an image observation are most important for decision making, used in visual feedback