UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function

📝 Paper Summary

LLM Alignment Reinforcement Learning from Human Feedback (RLHF) Direct Preference Optimization (DPO)

UNA unifies RLHF, DPO, and KTO by proving the optimal policy is induced by a generalized implicit reward function, allowing alignment via stable supervised regression between implicit and explicit rewards.

Core Problem

Existing alignment methods are fragmented: RLHF is unstable and memory-intensive; DPO is limited to pairwise data and ignores reward magnitude; KTO handles binary signals but lacks a unified framework for scalar scores.

Why it matters:

RLHF's PPO (Proximal Policy Optimization) stage is notoriously unstable and requires managing four separate models in memory
DPO improves stability but cannot utilize the rich, granular scalar information provided by reward models or fine-grained human scores
Current methods cannot seamlessly switch between data types (pairwise, binary, scalar) or training modes (offline vs. online) within a single mathematical framework

Concrete Example: In RLHF, an explicit reward model might assign a score of 0.9 to a high-quality response and 0.2 to a low-quality one. DPO ignores these specific values, only caring that 0.9 > 0.2. UNA utilizes the actual score differences to regress the policy, capturing the magnitude of preference.

Key Novelty

Generalized Implicit Reward Function

Mathematically proves that the optimal policy in the RLHF objective is induced by a specific logarithmic ratio of the policy and reference model
Transforms the alignment problem into a supervised learning task (e.g., MSE) that minimizes the difference between this 'implicit' reward (from the policy) and any 'explicit' reward (human or AI)
Unifies data ingestion: treats pairwise, binary, and scalar feedback as variations of the same regression problem

Architecture

A conceptual comparison of UNA against RLHF, DPO, and KTO workflows, highlighting data flow and model components.

Evaluation Highlights

+2.39 average score improvement on the new Open LLM Leaderboard using UNA-score (MSE) compared to DPO (28.53 vs 30.92)
Reduces training time for online alignment by ~18% (6.5 hours vs 8 hours for RLHF) while removing the need for a Value model
Achieves 6.78 on MT-Bench with UNA-binary, outperforming KTO (5.99) and DPO (6.1)

Breakthrough Assessment

8/10

Significantly simplifies the alignment landscape by unifying major paradigms (RLHF, DPO, KTO) under one mathematical derivation. The ability to use scalar rewards effectively in a DPO-like framework is a strong practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Aligning Large Language Models (LLMs) to human preferences using various feedback signals (pairwise, binary, or scalar)

Inputs: Prompt x, Response y, Explicit Reward r_phi (derived from pairs, thumbs up/down, or scores)

Outputs: Aligned Policy pi_theta

Pipeline Flow

Prompt Input
Policy Generation (LLM)
Implicit Reward Calculation
Explicit Reward Integration (Offline or Online)
Loss Minimization

System Modules

Policy Model

Generates responses and computes token probabilities for implicit reward calculation

Model or implementation: Mistral-7B-v0.1 (or Mistral-INST)

Reward Aggregator

Calculates the difference between the implicit reward (from policy) and explicit reward (from dataset or RM)

Model or implementation: Mathematical Function (MSE/BCE)

Novel Architectural Elements

Replacement of the PPO RL loop (Actor, Critic, Value Model) with a direct supervised regression loop between Implicit and Explicit rewards

Modeling

Base Model: mistralai/Mistral-7B-v0.1

Training Method: Unified Alignment (UNA) - Supervised Regression

Objective Functions:

Purpose: Minimize difference between implicit and explicit rewards.

Formally: L(pi_theta) = E_{(x,y)~D} [g(r_phi(x,y), r_theta(x,y))]
Purpose: Define implicit reward via policy ratio.

Formally: r_theta(x, y) = beta * log(pi_theta(y|x) / pi_ref(y|x))

Adaptation: LoRA (r=32)

Trainable Parameters: LoRA adapters on Policy Model

Training Data:

HelpSteer2 dataset (prompts, chosen/rejected responses, scalar scores)
Attributes: helpfulness, correctness, coherence, complexity, verbosity

Key Hyperparameters:

beta (binary): 0.01
beta (others): 0.03
learning_rate (score): 3e-5
+ 4 more
learning_rate (others): 5e-6
batch_size: Not explicitly reported in the paper (implies same as baselines)
online_beta: 30
online_learning_rate: 3e-6

Compute: 8x 80G A100 GPUs. Online UNA training time: 6.5 hours (vs 8 hours for RLHF).

Comparison to Prior Work

vs. RLHF: UNA removes the Value model and PPO instability, replacing it with stable supervised regression
vs. DPO: UNA incorporates scalar reward magnitudes (score-based) rather than just binary preference order
vs. KTO: UNA generalizes to continuous scores, whereas KTO is designed for binary signals
+ 1 more
vs. IPO [not cited in paper]: IPO also adds a regression-like term to DPO but focuses on preventing overfitting; UNA focuses on unifying feedback types via explicit reward matching

Limitations

Alignment tax (performance drop on some base capabilities) still exists for smaller models (1B-2B range)
Requires a separate Reward Model for the online setting (unlike DPO), which adds memory burden compared to pure DPO, though less than PPO
Experiments limited to 7B scale models; scaling laws for 'alignment tax' reduction not verified
Relies on the quality of the explicit reward (human or RM); noisy explicit rewards could propagate errors directly

Reproducibility

Code: https://github.com/ZhichaoWang970201/UNA-UFT/

publicly available (https://github.com/ZhichaoWang970201/UNA-UFT/). Uses HelpSteer2 dataset. Model weights for policy (Mistral) and reward model (Ray2333/GRM-Llama3.2-3B-rewardmodel-ft) are public.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Mistral-7B on HelpSteer2 and evaluating on standard benchmarks

Benchmarks:

Open LLM Leaderboard (New) (General Capabilities (BBH, GPQA, MMLU-Pro, etc.))
Open LLM Leaderboard (Old) (General Capabilities (GSM8K, TruthfulQA, ARC, etc.))
MT-Bench (Multi-turn conversation quality)
AlpacaEval (Instruction following)

Metrics:

Average Score (Leaderboard)
MT-Bench Score (1-10)
AlpacaEval Win Rate (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Offline experiments comparing UNA against DPO and KTO on standard benchmarks using different feedback types.
Open LLM Leaderboard (New)	Average Score	28.53	30.92	+2.39
MT-Bench	Score	5.99	6.78	+0.79
AlpacaEval	Win Rate	3.67	8.78	+5.11
Online experiments comparing UNA against RLHF (PPO) using a reward model.
Open LLM Leaderboard (New)	Average Score	29.12	29.15	+0.03
MT-Bench	Score	6.60	6.71	+0.11
AlpacaEval	Win Rate	10.15	10.54	+0.39

Experiment Figures

The specific application pipelines of UNA for different scenarios: Offline (DPO equivalent, KTO improvement, Distillation) and Online (RLHF simplification).

Main Takeaways

UNA consistently outperforms DPO and KTO in offline settings, particularly when leveraging scalar score data (UNA-score) which DPO cannot naturally use.
In online settings, UNA matches or slightly exceeds RLHF performance while eliminating the need for a Value model and improving training stability.
The framework effectively unifies different feedback granularities (binary vs. score) under a single loss formulation (difference minimization).
Training speed is improved compared to RLHF (6.5h vs 8h) due to simplified architecture.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry model for preference modeling
KL Divergence (Kullback-Leibler)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune models using a reward model trained on human preferences

PPO: Proximal Policy Optimization—an RL algorithm used in RLHF to update the policy while preventing drastic deviations

DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without an explicit reward model

KTO: Kahneman-Tversky Optimization—an alignment method dealing with unpaired binary feedback (thumbs up/down) using prospect theory

Implicit Reward: A reward value derived mathematically from the ratio of the current policy's probability to the reference policy's probability

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters

MSE: Mean Squared Error—a loss function measuring the average squared difference between estimated and actual values