Making Large Language Models Better Reasoners with Alignment

📝 Paper Summary

Mathematical Reasoning Chain-of-Thought (CoT) Fine-tuning Alignment / Preference Optimization

Alignment Fine-Tuning (AFT) improves LLM reasoning by calibrating the model's scoring of generated Chain-of-Thought paths using a constraint loss that prevents the degradation of valid but non-optimal reasoning paths.

Core Problem

Vanilla Fine-Tuning (VFT) suffers from 'Assessment Misalignment,' where models assign higher probabilities (scores) to incorrect reasoning paths than to correct non-reference paths because VFT only optimizes the single reference solution.

Why it matters:

Standard fine-tuned models cannot accurately assess the quality of their own generated reasoning chains, limiting self-consistency and reranking capabilities.
Existing alignment methods like RRHF and PRO degrade reasoning performance because they aggressively down-weight valid but lower-ranked responses without constraints.

Concrete Example: In a math problem, a VFT model assigns lower perplexity (better score) to a Candidate Answer that gets the math wrong (e.g., 50 * 0.2 = 100) than to a correct alternative reasoning path that differs from the training reference.

Key Novelty

Alignment Fine-Tuning (AFT) with Constraint Alignment (CA) Loss

Refines a fine-tuned model by generating multiple Chain-of-Thought (CoT) samples and categorizing them as positive (correct answer) or negative.
optimizes the model to ensure positive CoTs score higher than negatives, but crucially applies a 'constraint' (via gradient detaching or a soft boundary) to prevent the model from crushing the scores of reasonable negative samples.

Evaluation Highlights

AFT outperforms Vanilla Fine-Tuning (VFT) by +2.57% accuracy on GSM8K using Llama2-7B.
In ranking scenarios, AFT achieves 26.08% accuracy on GSM8K-RANK (Llama-7B), whereas unconstrained alignment (RRHF) collapses performance to 7.51%.
Generalizes to out-of-domain tasks: AFT improves zero-shot MMLU performance by +1.73% over VFT (Llama-7B).

Breakthrough Assessment

8/10

Identifies a critical flaw in applying standard alignment methods (RLHF/DPO styles) to reasoning: they destroy model capabilities by penalizing 'good enough' reasoning too harshly. The proposed constraint solution is simple and effective.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning Large Language Models (LLMs) on dataset {(q, c, a)} to generate correct chain-of-thought c and answer a.

Inputs: Question q

Outputs: Reasoning chain c followed by answer a

Pipeline Flow

VFT Training (Standard SFT)
Sampling (Generate k CoTs)
Annotation (Split into Positive/Negative groups)
Alignment Training (Apply Constraint Alignment Loss)

System Modules

Generator

Generate candidate Chain-of-Thoughts for training questions

Model or implementation: Fine-tuned Llama/Llama-2

Alignment Trainer

Update model weights to calibrate scores of generated candidates

Model or implementation: Llama/Llama-2

Modeling

Base Model: Llama-7B, Llama-13B, Llama2-7B, Llama2-13B

Training Method: Alignment Fine-Tuning (AFT)

Objective Functions:

Purpose: Maintain generation capability on reference data.

Formally: L_VFT = -sum(log P(reference_token))
Purpose: Align scores so positive CoTs > negative CoTs, with a constraint to prevent degradation.

Formally: L_A_BC = log(1 + sum(exp(s_neg - s_pos) + exp(T - s_neg)))

Training Data:

Sample k=6 responses per question from VFT model
Split into Positive (correct answer) and Negative (incorrect answer) groups
For GSM8K-RANK: 8 candidates generated and scored by ChatGPT

Key Hyperparameters:

sampling_temperature: 1.0
candidate_count_k: 6
boundary_beta: 0.15 (typically, varies by model/dataset)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RFT: AFT uses both positive AND negative samples (via loss) to calibrate scoring, whereas RFT only trains on positives.
vs. RRHF/PRO: AFT includes a 'constraint' term (detached gradients or boundary) that prevents the model from excessively penalizing negative samples, which prevents the performance collapse seen in RRHF on reasoning tasks.

Limitations

Constraint hyper-parameter beta requires search/tuning (no dynamic mechanism proposed).
Not scaled to larger models (65B/70B) due to resource limits.
Requires generating samples for the training set, which adds computational overhead compared to simple VFT.

Reproducibility

Code availability is not explicitly provided in the paper. Dataset details (GSM8K, AQUA, ECQA) are standard. Hyperparameters for boundary constraint beta are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Mathematical and Commonsense Reasoning

Benchmarks:

GSM8K (Math Word Problems)
AQUA-RAT (Algebra Word Problems)
ECQA (Commonsense Reasoning)
GSM8K-RANK (Math Reasoning with Ranking Feedback) [New]

Metrics:

Accuracy
Assessment Accuracy (AAccuracy)
Perplexity (PPL)
Statistical methodology: Reported mean and standard deviation over 3 runs with different seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AFT consistently improves accuracy over Vanilla Fine-Tuning (VFT) across multiple datasets and model sizes.
GSM8K	Accuracy	36.48	40.43	+3.95
GSM8K	Accuracy	47.29	51.03	+3.74
AQUA	Accuracy	31.19	33.20	+2.01
In ranking scenarios, AFT prevents the catastrophic degradation observed with other alignment methods like RRHF.
GSM8K	Accuracy	7.51	26.08	+18.57
GSM8K	Accuracy	20.82	26.08	+5.26

Experiment Figures

Perplexity comparison of different answers given by a VFT model.

Ablation studies on sampling number k, boundary parameter beta, and self-consistency paths.

Main Takeaways

Vanilla Fine-Tuning leads to 'Assessment Misalignment' where models cannot reliably distinguish good vs bad reasoning paths.
Traditional alignment methods (RRHF, PRO) without constraints degrade reasoning capability by over-penalizing non-top-ranked reasoning paths.
AFT's constraint mechanism allows the model to learn preferences while preserving generation quality, generalizing well to out-of-distribution tasks (MMLU) and enhancing Self-Consistency.

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (MLE objective)
Chain-of-Thought (CoT) Reasoning
Contrastive Learning / Ranking Loss

Key Terms

VFT: Vanilla Fine-Tuning—standard supervised training on reference data using Maximum Likelihood Estimation

AFT: Alignment Fine-Tuning—the proposed method incorporating generated samples and alignment loss

Assessment Misalignment: The phenomenon where a model assigns better scores (lower perplexity) to incorrect answers than to correct but non-reference answers

Constraint Alignment (CA): A loss function ensuring positive examples outscore negatives while preventing negative scores from dropping below a specific boundary

RRHF: Rank Responses to align Human Feedback—a baseline alignment method using pair-wise ranking loss

Self-Consistency: A decoding strategy that samples multiple reasoning paths and votes for the most common answer