Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

📝 Paper Summary

Alignment training Objective conflict in fine-tuning

ReSet reconciles the trade-off between instruction following and faithfulness by using rejection sampling to curate high-quality fine-tuning data that aligns both objectives.

Core Problem

Training language models on instruction-following data degrades their faithfulness to context, while training on context-dependent data degrades their ability to follow open-ended instructions.

Why it matters:

Modern LMs must be both helpful (follow instructions) and reliable (faithful to context), but current alignment methods often sacrifice one for the other
Mixing disparate datasets (creative writing vs. extractive QA) creates objective conflicts, confusing the model's internal optimization
Existing solutions like simple Multi-Task Learning (MTL) are sub-optimal and fail to fully resolve the interference between these competing goals

Concrete Example: When a model fine-tuned for faithfulness (context-dependent QA) is further trained on Alpaca (instruction following), its faithfulness score drops from 0.82 to 0.55 on abstractive tasks because it learns to hallucinate rather than ground its answers.

Key Novelty

Rejection Sampling for Continued Self-instruction Tuning (ReSet)

Instead of just mixing datasets, the method uses the model itself to generate multiple candidate responses for training prompts, varying decoding parameters like temperature
External judges (like GPT-4) score these generations on both instruction following and faithfulness, and only the highest-rated responses are kept
The model is then fine-tuned on this small, high-quality filtered dataset, effectively aligning it to the intersection of both objectives without massive retraining

Architecture

The ReSet pipeline: sampling generations from an MTL checkpoint, scoring them with judges, and filtering for continued fine-tuning.

Evaluation Highlights

+18.8% faithfulness improvement over Multi-Task Learning (MTL) baseline on unseen datasets using ReSet
+31.3% faithfulness improvement using ReSet-S (higher quality, 3x less data) compared to MTL
Maintains high instruction-following scores (0.75+) while significantly recovering the faithfulness lost during standard instruction tuning

Breakthrough Assessment

7/10

Provides clear empirical evidence of the trade-off between key alignment objectives and proposes a practical, data-efficient solution (ReSet) that outperforms standard MTL.

⚙️ Technical Details

Problem Definition

Setting: Two-stage fine-tuning where models must maximize two potentially conflicting scores: Instruction Following (IF) and Faithfulness (F)

Inputs: Instruction I and optional Context C

Outputs: Response R that follows I and is grounded in C

Pipeline Flow

Input Prompt (Instruction + Context) -> Generator Model (Candidate Sampling)
Candidates -> External Judges (Scoring)
Scored Candidates -> Rejection Sampling (Filtering)
Selected Candidates -> Fine-tuning (Optimization)

System Modules

Generator

Generate multiple candidate responses for a given input using varied decoding parameters

Model or implementation: Vicuna-7B (fine-tuned via MTL first)

Judge

Score generations on faithfulness, instruction following, and task performance

Model or implementation: GPT-4 (or ChatGPT for ReSet-S)

Fine-tuner

Update model weights using the curated 'Gold' dataset

Model or implementation: LLaMA-7B / Vicuna-7B

Novel Architectural Elements

Use of rejection sampling specifically to reconcile conflicting objectives (faithfulness vs. instruction following) by filtering for the intersection of both
Data-centric pipeline where the model self-corrects via judge-filtered self-instruction rather than architectural modification

Modeling

Base Model: LLaMA-7B and Vicuna-7B

Training Method: Supervised Fine-Tuning (SFT) on curated data

Objective Functions:

Purpose: Rank and select best generation.

Formally: Score = w_task * S_task + w_instr * S_instr * I_instr + w_faith * S_faith * I_faith
Purpose: Standard language modeling loss for fine-tuning.

Formally: Minimize negative log-likelihood of selected tokens

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (7B model)

Training Data:

Instruction datasets: Dolly-15K, ShareGPT, Self-Instruct, OASST-1
Context-dependent datasets: NQ, MS MARCO, CNN/DM (RobustQA benchmark)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 1 (for ReSet stage)
+ 2 more
max_seq_length: 2048
ReSet_dataset_size: 8,000 examples (standard) or 2,000 (ReSet-S)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Vanilla MTL: ReSet filters data for quality (meeting both objectives) rather than just mixing conflicting data points
vs. Standard Instruction Tuning: ReSet explicitly incorporates faithfulness constraints into the data selection process
vs. RAG-tuning: ReSet maintains open-ended instruction following capabilities while improving grounding

Limitations

Relies on GPT-4 as an external judge, which may be costly or have its own biases
Experiments limited to 7B parameter models (LLaMA-1 era)
Does not explore iterative rounds of ReSet (only one iteration performed)
Evaluation relies heavily on automated metrics (LLM-as-a-judge, SummaC) which are proxies for human judgment

Reproducibility

Code: https://github.com/frankaging/dancing-in-chains

📊 Experiments & Results

Evaluation Setup

Evaluated on held-out instruction following sets and context-dependent sets

Benchmarks:

Vicuna-eval (Instruction Following)
Alpaca-eval (Instruction Following)
BioASQ, SearchQA, WikiSum (Context-dependent QA/Summarization (Unseen))

Metrics:

Faithfulness Score (Span Coverage / SummaC-ZS)
Instruction Following Score (LLM-as-a-Judge)
Task Performance (EM / ROUGE-L)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Trade-off analysis showing that optimizing for one objective hurts the other.
Abstractive QA (Faithfulness)	Faithfulness Score	0.82	0.55	-0.27
Instruction Following	Instruction Following Score	0.79	0.49	-0.30
Main results comparing ReSet to baselines on unseen datasets.
Unseen Datasets (Average)	Faithfulness Score	0.32	0.38	+0.06
Unseen Datasets (Average)	Faithfulness Score	0.32	0.42	+0.10
Unseen Datasets (Average)	Instruction Following Score	0.75	0.75	0.00

Experiment Figures

Radar charts demonstrating the trade-off: LLaMA-7B fine-tuned for Context loses Faithfulness when tuned for Instructions, and Vicuna-7B loses Instruction Following when tuned for Context.

Faithfulness drop broken down by generation length.

Main Takeaways

There is a stark trade-off: training for open-ended instruction following makes models hallucinate more (less faithful), and training for strict grounding makes them worse at following instructions.
Less is more: ReSet-S uses 3-fold less data (2,000 examples) than ReSet (8,000 examples) but achieves better faithfulness by filtering for higher quality.
Multi-Task Learning (MTL) is a strong baseline but suboptimal because it includes conflicting gradient signals; Rejection Sampling (ReSet) resolves this by selecting data points where objectives align.
Longer generations are more prone to faithfulness degradation when fine-tuning on instruction data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Instruction Tuning (e.g., Alpaca, Vicuna)
Knowledge of RAG (Retrieval-Augmented Generation) and context-dependent QA
Familiarity with Rejection Sampling and Self-Instruct paradigms

Key Terms

ReSet: Rejection Sampling for Continued Self-instruction Tuning—the proposed method of filtering model generations to create a high-quality fine-tuning dataset

Faithfulness: The degree to which a model's response is grounded in and supported by the provided source context, rather than hallucinated

Instruction Following: The ability of a model to adhere to open-ended user requests, style constraints, and formatting rules

MTL: Multi-Task Learning—training a model simultaneously on mixed datasets (here, both instruction-following and context-dependent data)

LLM-as-a-Judge: Using a strong LLM (like GPT-4) to evaluate the quality of outputs from a smaller model

Rejection Sampling: A technique where multiple samples are generated, evaluated against a criterion, and only valid/high-quality samples are retained for training

Supercharge: In this paper, a variant of ReSet (ReSet-S) that uses more aggressive sampling and filtering to create a smaller but higher-quality dataset

SummaC-ZS: A zero-shot metric for checking if a summary or answer is entailed by the source text, used here to measure faithfulness