Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, Yang Liu
Department of Computer Science and Technology, Tsinghua University,
Shanghai Artificial Intelligence Laboratory
International Conference on Learning Representations
(2023)
RLBenchmark
📝 Paper Summary
LLM Fine-tuningAlignment with Mixed-Quality Data
OpenChat aligns LLMs using mixed-quality data without preference labels by treating data sources as coarse rewards and training a class-conditioned policy via single-stage supervised learning.
Core Problem
Standard supervised fine-tuning treats all mixed-quality data equally, degrading performance, while RLHF requires expensive, high-quality human preference labels that are difficult to obtain.
Why it matters:
Open-source SFT datasets (like ShareGPT) contain large amounts of sub-optimal data (e.g., from GPT-3.5) mixed with sparse expert data (GPT-4), hurting model quality if trained indiscriminately
Collecting pairwise preference data for RLHF is costly and labor-intensive, creating a barrier for open-source model development
Existing RLFT methods are complex and unstable, often requiring multiple stages and separate reward models
Concrete Example:A dataset might mix high-quality GPT-4 responses with lower-quality GPT-3.5 responses. Standard SFT trains on both equally, causing the model to learn sub-optimal behaviors. OpenChat conditions the model on the source, allowing it to distinguish and generate only 'expert' quality outputs at inference.
Key Novelty
Conditioned-RLFT (C-RLFT)
Treats different data sources (e.g., GPT-4 vs. GPT-3.5) as coarse-grained reward classes rather than using fine-grained individual preference labels
Optimizes the policy by conditioning the LLM on these class labels during training (learning to mimic specific sources) and regularizing against a class-conditioned reference policy
Simplifies the RL problem into a single-stage, supervised reward-weighted regression, avoiding the need for separate reward models or PPO training loops
Architecture
Conceptual framework of OpenChat showing the C-RLFT process. It illustrates separating mixed SFT data into expert and sub-optimal sets, training a class-conditioned policy, and performing inference with the expert condition.
Evaluation Highlights
openchat-13b achieves the highest average performance among all 13b open-source language models on Alpaca-Eval, MT-bench, and Vicuna-bench
openchat-13b surpasses gpt-3.5-turbo on Alpaca-Eval, MT-bench, and Vicuna-bench despite using mixed-quality training data
Achieves top-1 average accuracy among all 13b open-source models on AGIEval, demonstrating generalization beyond instruction following
Breakthrough Assessment
9/10
Proposes a theoretically grounded yet extremely simple method (conditional SFT) to solve the complex RLHF problem for mixed-quality data, achieving SOTA results for its size class without preference labels.
⚙️ Technical Details
Problem Definition
Setting: Fine-tuning a pre-trained LLM using a dataset containing a mix of expert and sub-optimal conversations without preference labels.
Inputs: Instruction x and mixed-quality response dataset D (containing subsets D_exp and D_sub)
Outputs: Fine-tuned policy pi_theta(y|x) that generates high-quality responses
Pipeline Flow
Input Processing: Data Augmentation with Class Condition
Generation: Class-Conditioned LLM Inference
System Modules
Prompt Formatter
Augments input instructions with class-specific templates to denote data source quality
Model or implementation: Rule-based string formatting
Conditional Generator
Generates response based on instruction and quality condition
Model or implementation: llama-2-13b
Novel Architectural Elements
Class-conditioned policy architecture: The model is explicitly trained to accept a 'quality token' (via prompt templates) that acts as a control variable for response quality
Modeling
Base Model: llama-2-13b
Training Method: Conditioned-RLFT (C-RLFT)
Objective Functions:
Purpose: Maximize likelihood of the data given the instruction and the class condition.
Adaptation: Full fine-tuning (implied, as standard for 13B models in this context)
Training Data:
ShareGPT dataset containing mixed GPT-4 (expert) and GPT-3.5 (sub-optimal) conversations
Data split into classes based on source models
Compute: Not reported in the paper
Comparison to Prior Work
vs. Vicuna: OpenChat conditions on data sources to filter low-quality noise, whereas Vicuna treats all ShareGPT data equally
vs. RLHF: OpenChat uses coarse-grained source labels (expert vs sub-optimal) instead of fine-grained pairwise preferences and avoids unstable PPO training
vs. DPO: OpenChat does not require pairwise preference data (x, y_w, y_l), only single samples with source labels [not cited in paper]
Limitations
Relies on the assumption that data sources (classes) have distinct and consistent quality levels (e.g., GPT-4 is always better than GPT-3.5)
Does not utilize fine-grained preference information within a class (e.g., distinguishing better vs worse GPT-4 responses)
Requires distinct data sources; effectiveness on datasets without clear source distinctions is unclear
Code, data, and models are publicly available at https://github.com/imoneoi/openchat and HuggingFace. The specific conversation templates are provided in the paper.
📊 Experiments & Results
Evaluation Setup
Instruction following and standard NLP benchmarks
Benchmarks:
Alpaca-Eval (Instruction following evaluation (win-rate against text-davinci-003))
MT-bench (Multi-turn conversation evaluation (GPT-4 based grading))
Vicuna-bench (Chatbot arena style evaluation)
AGIEval (Standardized exams (GRE, SAT, etc.))
Metrics:
Win rate
Average score (MT-bench)
Accuracy (AGIEval)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
MT-bench
Average score
Not reported in the paper
Not reported in the paper
Not reported in the paper
Main Takeaways
C-RLFT effectively leverages mixed-quality data (ShareGPT) to outperform baselines that treat all data equally (like Vicuna), without needing preference labels.
The method achieves state-of-the-art performance for 13B open-source models across multiple benchmarks (Alpaca-Eval, MT-bench, Vicuna-bench), reportedly surpassing GPT-3.5-turbo.
The approach is lightweight and avoids the complexities of RLHF (like training separate reward models or PPO), effectively reducing alignment to a single-stage supervised learning problem.
📚 Prerequisite Knowledge
Prerequisites
Supervised Fine-tuning (SFT) for LLMs
Reinforcement Learning from Human Feedback (RLHF)
KL-regularized Reinforcement Learning
Key Terms
C-RLFT: Conditioned Reinforcement Learning Fine-tuning—the proposed method that uses data sources as coarse reward signals to train a class-conditioned policy
SFT: Supervised Fine-Tuning—training a model to predict the next token on a dataset of instruction-response pairs
RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences
ShareGPT: A dataset of user-shared conversations with ChatGPT (containing both GPT-3.5 and GPT-4 data)
KL divergence: A statistical distance measure used to prevent the fine-tuned model from deviating too far from the reference model
PPO: Proximal Policy Optimization—a common reinforcement learning algorithm used in RLHF
AGIEval: A benchmark evaluating foundation models on human-centric standardized exams (e.g., GRE, SAT)