LogiCoT: Logical Chain-of-Thought Instruction-Tuning

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Instruction Tuning Knowledge Distillation

LogiCoT enhances the logical reasoning capabilities of smaller open-source language models by fine-tuning them on high-quality chain-of-thought rationales distilled from GPT-4 across diverse logical tasks.

Core Problem

General instruction-tuned models (like Alpaca) improve general proficiency but struggle significantly with complex, multi-step logical reasoning tasks compared to proprietary models like GPT-4.

Why it matters:

Current open-source community models lack robust logical deduction skills, inhibiting their use in complex real-world reasoning scenarios.
Developing proprietary reasoning models requires massive undisclosed data/engineering; distilling this capability into smaller models is a cost-effective alternative.

Concrete Example: Given premises 'Jessica plays if and only if it is cloudy' and 'It is late implies Jessica plays', a standard model might fail to deduce 'It is late implies it is cloudy'. LogiCoT teaches the model to explicitly output 'via Biconditional Elimination... via Hypothetical Syllogism' to reach the correct conclusion.

Key Novelty

Logical Chain-of-Thought Distillation

Constructs a dataset by repurposing existing logical benchmarks (symbolic and narrative) and prompting GPT-4 to act as a 'teaching assistant' that generates step-by-step rationales.
Introduces specific instruction types (e.g., Language-to-Logic, Inference Chains, Argument Strengthening) to force the model to learn structured logical transitions.

Architecture

The data construction and instruction tuning pipeline.

Evaluation Highlights

+32.2% accuracy improvement on LogiQA 2.0 (logical reading comprehension) compared to the LLaMA-7b-base model.
Outperforms the larger LLaMA-30b-supercot model on 6 out of 8 logical reasoning datasets, despite having fewer parameters.
Achieves parity with ChatGPT on English logical reasoning tasks like ReClor (57.60% vs 57.38%) and LogiQA OOD (38.79% vs 38.44%).

Breakthrough Assessment

7/10

Significant performance jump on specific logic tasks for a small 7B model. Demonstrates high utility of domain-specific CoT distillation, though it still lags behind GPT-4.

⚙️ Technical Details

Problem Definition

Setting: Instruction tuning a Large Language Model (LLM) on a dataset of (Instruction, Input, CoT Rationale, Output) tuples.

Inputs: Natural language instruction $I$ and context/question $X$.

Outputs: Chain-of-thought rationale $R$ followed by the final answer $Y$.

Pipeline Flow

Instruction Formatter (Templates raw data into prompt)
LLaMA-7b-logicot (Processes input)
Output Generator (Produces Reasoning Chain + Answer)

System Modules

Instruction Formatter

Wrap input data with task-specific instructions (e.g., 'Identify the Necessary Claim')

Model or implementation: Rule-based templates

LLaMA-7b-logicot

Generate logical reasoning steps and final conclusion

Model or implementation: LLaMA-7b (Fine-tuned)

Novel Architectural Elements

Integration of varied logical instruction types (Symbolic translation, Inference chaining, MRC) into a single instruction-tuning mix to enable cross-task logical generalization.

Modeling

Base Model: LLaMA-7b

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Minimize the difference between generated tokens and target tokens (rationale + answer).

Formally: Standard Cross-Entropy Loss.

Adaptation: Full fine-tuning

Training Data:

Total size: 68,983 instances.
Sources: Logic Inference, EntailmentBank, FOLIO (Sequence-to-Sequence tasks), LogiQA, ReClor (MRC tasks).
Augmentation: GPT-4 used to generate CoT rationales for the MRC tasks and Logic Inference tasks where needed.

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 4
epochs: 2

Compute: 2x A100 GPUs, 4 days training time. DeepSpeed library used.

Comparison to Prior Work

vs. LLaMA-30b-supercot: LogiCoT uses logic-specific instructions and rationales rather than general CoT data.
vs. Alpaca: LogiCoT focuses on rigorous multi-step deduction (symbolic and text) rather than general conversational ability.
vs. PINTO [not cited in paper]: PINTO focuses on counterfactual regularization for faithfulness, while LogiCoT focuses on scaling logical instruction diversity.

Limitations

Weaker performance on Chinese language tasks (LogiQA-zh) compared to English.
Still trails behind GPT-4 significantly (approx. 20% gap on LogiEval overall).
Human evaluation indicates 'Faithfulness' of generated chains (4.5/5) lags slightly behind Relevance (4.9/5).
Evaluated primarily on multi-choice and short answer; open-ended generation less explored.

Reproducibility

Code: https://huggingface.co/datasets/csitfun/LogiCoT

Dataset and model weights are publicly available on HuggingFace. Training scripts follow the Stanford Alpaca implementation. GPT-4 was used via API for data generation (costs not specified).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on logical reasoning and general knowledge benchmarks.

Benchmarks:

LogiEval (Logical Reasoning Suite (LogiQA, ReClor, AR-LSAT, etc.))
MMLU (General Knowledge (Massive Multitask Language Understanding))

Metrics:

Accuracy (%)
Exact Match (for instruction following)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LogiCoT significantly outperforms the base model and larger instruction-tuned baselines on logical reasoning benchmarks.
LogiQA 2.0	Accuracy	18.04	50.25	+32.21
ReClor	Accuracy	15.83	57.60	+41.77
MMLU (Average)	Accuracy	31.8	43.3	+11.5
LogiEval Overall	Accuracy	24.78	40.69	+15.91
Ablation studies show that removing specific reasoning instruction types degrades performance.
LogiEval Overall	Accuracy	40.7	30.8	-9.9
LogiEval Overall	Accuracy	40.7	32.4	-8.3

Main Takeaways

Specialized instruction tuning on logical CoT allows a 7B model to outperform 30B+ general-purpose models on reasoning tasks.
Symbolic reasoning data (Language to Logic) contributes significantly to performance on natural language tasks, suggesting transferability of abstract logic skills.
The model generalizes well to general knowledge tasks (MMLU) despite being tuned specifically for logic, achieving +11.5% over base.
Performance on AR-LSAT (analytical reasoning) remains low for all models (including logicot), highlighting a remaining challenge in handling complex constraint satisfaction problems.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs)
Instruction Tuning
Chain-of-Thought (CoT) Prompting
Symbolic Logic (Propositional and First-Order Logic)

Key Terms

CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer.

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset to follow specific instructions.

FOL: First-Order Logic—a formal system using quantifiers (forall, exists) and predicates to express logical relations.

Distillation: The process of training a smaller 'student' model to mimic the outputs or behavior of a larger 'teacher' model (here, GPT-4).

Zero-Shot-CoT: Prompting a model with 'Let's think step by step' without providing examples, to elicit reasoning.

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law.

LogiEval: A benchmark suite specifically designed to test logical reasoning, comprising datasets like LogiQA, ReClor, and AR-LSAT.

MRC: Machine Reading Comprehension—tasks where the model answers questions based on a provided text passage.