OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

📝 Paper Summary

LLM Fine-tuning Alignment with Mixed-Quality Data

OpenChat aligns LLMs using mixed-quality data without preference labels by treating data sources as coarse rewards and training a class-conditioned policy via single-stage supervised learning.

Core Problem

Standard supervised fine-tuning treats all mixed-quality data equally, degrading performance, while RLHF requires expensive, high-quality human preference labels that are difficult to obtain.

Why it matters:

Open-source SFT datasets (like ShareGPT) contain large amounts of sub-optimal data (e.g., from GPT-3.5) mixed with sparse expert data (GPT-4), hurting model quality if trained indiscriminately
Collecting pairwise preference data for RLHF is costly and labor-intensive, creating a barrier for open-source model development
Existing RLFT methods are complex and unstable, often requiring multiple stages and separate reward models

Concrete Example: A dataset might mix high-quality GPT-4 responses with lower-quality GPT-3.5 responses. Standard SFT trains on both equally, causing the model to learn sub-optimal behaviors. OpenChat conditions the model on the source, allowing it to distinguish and generate only 'expert' quality outputs at inference.

Key Novelty

Conditioned-RLFT (C-RLFT)

Treats different data sources (e.g., GPT-4 vs. GPT-3.5) as coarse-grained reward classes rather than using fine-grained individual preference labels
Optimizes the policy by conditioning the LLM on these class labels during training (learning to mimic specific sources) and regularizing against a class-conditioned reference policy
Simplifies the RL problem into a single-stage, supervised reward-weighted regression, avoiding the need for separate reward models or PPO training loops

Architecture

Conceptual framework of OpenChat showing the C-RLFT process. It illustrates separating mixed SFT data into expert and sub-optimal sets, training a class-conditioned policy, and performing inference with the expert condition.

Evaluation Highlights

openchat-13b achieves the highest average performance among all 13b open-source language models on Alpaca-Eval, MT-bench, and Vicuna-bench
openchat-13b surpasses gpt-3.5-turbo on Alpaca-Eval, MT-bench, and Vicuna-bench despite using mixed-quality training data
Achieves top-1 average accuracy among all 13b open-source models on AGIEval, demonstrating generalization beyond instruction following

Breakthrough Assessment

9/10

Proposes a theoretically grounded yet extremely simple method (conditional SFT) to solve the complex RLHF problem for mixed-quality data, achieving SOTA results for its size class without preference labels.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a pre-trained LLM using a dataset containing a mix of expert and sub-optimal conversations without preference labels.

Inputs: Instruction x and mixed-quality response dataset D (containing subsets D_exp and D_sub)

Outputs: Fine-tuned policy pi_theta(y|x) that generates high-quality responses

Pipeline Flow

Input Processing: Data Augmentation with Class Condition
Generation: Class-Conditioned LLM Inference

System Modules

Prompt Formatter

Augments input instructions with class-specific templates to denote data source quality

Model or implementation: Rule-based string formatting

Conditional Generator

Generates response based on instruction and quality condition

Model or implementation: llama-2-13b

Novel Architectural Elements

Class-conditioned policy architecture: The model is explicitly trained to accept a 'quality token' (via prompt templates) that acts as a control variable for response quality

Modeling

Base Model: llama-2-13b

Training Method: Conditioned-RLFT (C-RLFT)

Objective Functions:

Purpose: Maximize likelihood of the data given the instruction and the class condition.

Formally: L(theta) = - E_{(x,y,c) ~ D_c} [log pi_theta(y | x, c)]

Adaptation: Full fine-tuning (implied, as standard for 13B models in this context)

Training Data:

ShareGPT dataset containing mixed GPT-4 (expert) and GPT-3.5 (sub-optimal) conversations
Data split into classes based on source models

Compute: Not reported in the paper

Comparison to Prior Work

vs. Vicuna: OpenChat conditions on data sources to filter low-quality noise, whereas Vicuna treats all ShareGPT data equally
vs. RLHF: OpenChat uses coarse-grained source labels (expert vs sub-optimal) instead of fine-grained pairwise preferences and avoids unstable PPO training
vs. DPO: OpenChat does not require pairwise preference data (x, y_w, y_l), only single samples with source labels [not cited in paper]

Limitations

Relies on the assumption that data sources (classes) have distinct and consistent quality levels (e.g., GPT-4 is always better than GPT-3.5)
Does not utilize fine-grained preference information within a class (e.g., distinguishing better vs worse GPT-4 responses)
Requires distinct data sources; effectiveness on datasets without clear source distinctions is unclear

Reproducibility

Code: https://github.com/imoneoi/openchat

Code, data, and models are publicly available at https://github.com/imoneoi/openchat and HuggingFace. The specific conversation templates are provided in the paper.

📊 Experiments & Results

Evaluation Setup

Instruction following and standard NLP benchmarks

Benchmarks:

Alpaca-Eval (Instruction following evaluation (win-rate against text-davinci-003))
MT-bench (Multi-turn conversation evaluation (GPT-4 based grading))
Vicuna-bench (Chatbot arena style evaluation)
AGIEval (Standardized exams (GRE, SAT, etc.))

Metrics:

Win rate
Average score (MT-bench)
Accuracy (AGIEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MT-bench	Average score	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

C-RLFT effectively leverages mixed-quality data (ShareGPT) to outperform baselines that treat all data equally (like Vicuna), without needing preference labels.
The method achieves state-of-the-art performance for 13B open-source models across multiple benchmarks (Alpaca-Eval, MT-bench, Vicuna-bench), reportedly surpassing GPT-3.5-turbo.
The approach is lightweight and avoids the complexities of RLHF (like training separate reward models or PPO), effectively reducing alignment to a single-stage supervised learning problem.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-tuning (SFT) for LLMs
Reinforcement Learning from Human Feedback (RLHF)
KL-regularized Reinforcement Learning

Key Terms

C-RLFT: Conditioned Reinforcement Learning Fine-tuning—the proposed method that uses data sources as coarse reward signals to train a class-conditioned policy

SFT: Supervised Fine-Tuning—training a model to predict the next token on a dataset of instruction-response pairs

RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences

ShareGPT: A dataset of user-shared conversations with ChatGPT (containing both GPT-3.5 and GPT-4 data)

KL divergence: A statistical distance measure used to prevent the fine-tuned model from deviating too far from the reference model

PPO: Proximal Policy Optimization—a common reinforcement learning algorithm used in RLHF

AGIEval: A benchmark evaluating foundation models on human-centric standardized exams (e.g., GRE, SAT)