Unsupervised Elicitation of Language Models

📝 Paper Summary

Unsupervised Alignment Reward Modeling

ICM fine-tunes pretrained models on tasks by searching for label assignments that are mutually predictable and logically consistent according to the model itself, without external human supervision.

Core Problem

Post-training typically relies on human supervision, which becomes unreliable or impossible to obtain for superhuman capabilities or complex tasks where humans make mistakes.

Why it matters:

Obtaining high-quality human labels for frontier models is expensive and increasingly difficult as models surpass human performance
Current methods like RLHF are limited by human biases and errors, potentially capping model performance at human levels
Models already possess latent concepts (truthfulness, helpfulness) from pretraining that are not fully utilized by standard prompting

Concrete Example: In an author gender prediction task, humans often rely on superficial stereotypes, achieving only 60% accuracy. The pretrained model contains deeper linguistic patterns but needs a way to surface them without being trained on the low-quality human labels.

Key Novelty

Internal Coherence Maximization (ICM)

Defines a scoring function based on 'mutual predictability'—how well the model can predict one label given all others—and 'logical consistency' (e.g., A>B implies B<A)
Uses a simulated annealing-inspired search algorithm to find the specific label assignment for a dataset that maximizes this internal coherence score
Fine-tunes the model on these self-generated, coherent labels instead of external ground truth

Architecture

Illustration of the Mutual Predictability concept and the ICM iterative search process

Evaluation Highlights

Matches golden label performance on GSM8K and TruthfulQA using Llama-3-70B, despite using zero external labels
Achieves ~80% accuracy on author gender prediction (a superhuman task), significantly outperforming human annotators (60%)
An unsupervised Claude 4 Sonnet assistant trained via ICM matches a human-supervised counterpart on average, with higher scores on chat and safety metrics

Breakthrough Assessment

9/10

Demonstrates successful alignment of a frontier model (Claude 4 Sonnet) without ANY human supervision, matching human-supervised performance in a realistic setting. A significant step for superhuman oversight.

⚙️ Technical Details

Problem Definition

Setting: Unsupervised fine-tuning / Label elicitation for classification tasks

Inputs: A set of unlabeled inputs {x_i}

Outputs: Inferred labels {y_i} used to fine-tune the model

Pipeline Flow

Initialization (Start with K random labels)
Iterative Search (Sample example -> Predict Label -> Fix Inconsistencies -> Accept/Reject)
Fine-tuning (Train on final label set)

System Modules

Label Searcher (Label Generation)

Iteratively assigns labels to the dataset to maximize the scoring function

Model or implementation: Pretrained Model (e.g., Llama-3-70B, Claude 4 Sonnet)

Consistency Enforcer (Label Generation)

Resolves logical contradictions when a new label is proposed

Model or implementation: Algorithmic logic check

Fine-Tuner

Updates the model weights using the generated labels

Model or implementation: Same as pretrained base

Novel Architectural Elements

Inference-time label optimization loop: Uses the model's own in-context predictions as a scoring metric to globally optimize labels before training

Modeling

Base Model: Llama-3-8B, Llama-3-70B, Claude 4 Sonnet

Training Method: Supervised Fine-Tuning (SFT) or Reward Modeling followed by RL

Objective Functions:

Purpose: Measure quality of a label set during search.

Formally: U(D) = Σ log P_θ(y_i | x_i, D_{-i}) + α * Σ c(x_i, y_i, x_j, y_j)

Training Data:

Generated labels via ICM search process
5,000 pairwise preference data generated for Claude 4 Sonnet experiment

Key Hyperparameters:

K (initial examples): 8
sampling_weight_factor: 100 (for examples with consistency relations)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Zero-shot: ICM optimizes labels globally for consistency and mutual predictability rather than taking single-pass predictions
vs. Weak-to-Strong: Does not require a weak supervisor; elicits latent knowledge directly
vs. RLHF: Removes the need for human preference labels entirely

Limitations

Cannot elicit concepts that are not 'salient' to the pretrained model (e.g., arbitrary preference for the word 'sun')
Computationally expensive search process compared to simple prompting
Relies on simple logical constraints which may not capture all task nuances
Risk of data contamination in pretraining corpus is difficult to rule out completely

Reproducibility

Code availability is not provided. Datasets used (TruthfulQA, GSM8K, Alpaca, Blog Authorship) are public. Claude 4 Sonnet/Opus models are proprietary. Exact hyperparameters for RL training and RM training are not fully detailed.

📊 Experiments & Results

Evaluation Setup

Classification of correctness (GSM8K, TruthfulQA) or preference (Alpaca, Gender Prediction)

Benchmarks:

TruthfulQA (Truthfulness Classification)
GSM8K-verification (Mathematical Correctness Classification)
Alpaca (Helpfulness/Harmlessness Preference)
Blog Authorship Corpus (Author Gender Prediction)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of ICM against baselines on standard NLP tasks using Llama-3-70B.
TruthfulQA	Accuracy	0.58	0.58	0.00
GSM8K-verification	Accuracy	0.85	0.85	0.00
Alpaca	Accuracy	0.65	0.74	+0.09
Results on the superhuman task of author gender prediction.
Blog Authorship Corpus	Accuracy	0.60	0.80	+0.20

Experiment Figures

Bar charts comparing ICM accuracy against Zero-shot, Human Label, and Golden Label baselines on TruthfulQA, GSM8K, and Alpaca.

Radar chart comparing Claude 4 Sonnet assistants trained via RL with ICM vs. Human Labels.

Main Takeaways

ICM consistently matches the performance of training on golden labels for tasks where the concept is salient in pretraining
On subjective tasks like Alpaca, ICM outperforms training on crowdsourced human labels
For superhuman tasks (gender prediction), ICM exposes capabilities that human supervision suppresses
RL training with an unsupervised ICM reward model is faster (2.5x speedup) than with a human-supervised RM

📚 Prerequisite Knowledge

Prerequisites

Self-supervised learning
Reinforcement Learning from Human Feedback (RLHF)
Simulated Annealing
In-context learning

Key Terms

ICM: Internal Coherence Maximization—the proposed unsupervised algorithm that labels data by maximizing mutual predictability and logical consistency

Mutual Predictability: A measure of how confidently a model can predict the label of one example when conditioned on the labels of other examples in the dataset

Logical Consistency: Constraints applied to labels to prevent contradictions (e.g., if answer A is correct, answer B cannot also be correct for the same question)

Simulated Annealing: A probabilistic optimization algorithm used here to search for the best set of labels by iteratively accepting or rejecting changes based on a scoring function

In-context learning: The ability of a model to perform a task by conditioning on examples provided within the prompt, used here to estimate mutual predictability

Reward Model (RM): A model trained to predict a scalar score indicating the quality or preference of a response, usually used to guide reinforcement learning

RLHF: Reinforcement Learning from Human Feedback—a standard method for aligning language models using human preference labels

Golden Labels: Ground truth labels provided by the dataset creators or experts, used as a ceiling for performance comparisons

Constitutional AI: A method for aligning AI systems using a set of principles (a constitution) and AI feedback rather than direct human labels