Self-supervised analogical learning using language models

📝 Paper Summary

Reasoning Consistency Symbolic Reasoning Analogical Learning

Sal improves reasoning consistency by training models to generate abstract symbolic programs from self-generated analogous questions, transferring successful reasoning patterns from familiar to rare scenarios.

Core Problem

Large language models suffer from reasoning inconsistency, failing on unfamiliar questions even when they can solve structurally identical questions involving common entities.

Why it matters:

Inconsistency prevents deployment in mission-critical tasks like medical chatbots requiring trustworthy decision-making
Even advanced models like OpenAI o1 fail on rare cases despite knowing the relevant facts, showing that memorization doesn't equal robust reasoning capability

Concrete Example: A model correctly answers 'Is Donald Trump in New York?' (common entity) but fails on 'Is [Less Common Person] in [Location]?' even though it knows the person's location, simply because the entity combination is rare.

Key Novelty

Self-supervised Analogical Learning (Sal)

Conceptualization: Generates abstract versions of a hard question, creates many easier similar questions, solves them to get symbolic programs, and uses those programs as supervision for the original hard question.
Simplification: Decomposes complex math problems into simpler sub-questions, iteratively building a 'known conditions' set to generate high-quality symbolic programs for self-supervision.

Architecture

The Conceptualization extraction pipeline.

Evaluation Highlights

Outperforms base language models and Chain-of-Thought baselines by 2% to 20% across StrategyQA, GSM8K, and HotpotQA benchmarks.
Simplification method increases the yield of high-confidence self-supervision programs from 25.3% to 44.3% on GSM8K math questions.
Demonstrates improved generalizability and controllability due to the use of symbolic programmatic solutions.

Breakthrough Assessment

8/10

Strong conceptual contribution in using self-generated analogies for supervision. Addresses a critical LLM weakness (consistency) with significant empirical gains (up to 20%) without requiring external human labels.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised learning for complex reasoning tasks (QA and Math)

Inputs: Natural language question q

Outputs: Symbolic program p that executes to the final answer

Pipeline Flow

Program Generation (generate Python program based on input q)
Program Execution (execute program using Python interpreter)
Helper Execution (handle ask_llm calls during execution via LLM query)

System Modules

Program Generator

Generate a symbolic Python program solving the question

Model or implementation: Fine-tuned Base LLM (e.g., Mixtral-8x7B)

Executor (Execution)

Run the generated Python code to obtain the final answer

Model or implementation: Python Interpreter

LLM Oracle (Execution)

Answer knowledge-intensive queries called within the program

Model or implementation: Base LLM

Novel Architectural Elements

Analogical self-supervision loop: Training data is generated by mapping hard questions to abstract templates, instantiating easy similar questions, solving those, and transferring the solution program back to the original hard question.

Modeling

Base Model: Mixtral-8x7B-Instruct

Training Method: Supervised Fine-Tuning (SFT) on self-generated data

Objective Functions:

Purpose: Standard language modeling loss on the generated programs.

Formally: Cross-entropy loss on token sequences of the target programs.

Adaptation: Full fine-tuning (implied by context of general instruction tuning)

Training Data:

Seed data: ~500 questions (StrategyQA + GSM8K) with CoT and programs used to prime the generator
Self-supervision data: Generated via Conceptualization (20 similar questions per abstract question) and Simplification pipelines

Key Hyperparameters:

N (similar questions per abstract): 20
K (CoT samples): 10
X (agreement threshold): 9
+ 1 more
paraphrase_threshold: 0.95 (drop if program query is too similar to question)

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT: Sal generates symbolic programs trained on analogous easy cases, improving consistency on rare entities.
vs. PoT: Sal uses an explicit 'conceptualization' step to mine training data from easier analogical questions, rather than just training on gold programs.
vs. Least-to-Most: Sal's 'simplification' is a data generation strategy for self-training, not just an inference-time prompting strategy.
+ 1 more
vs. Analogical Prompting [not cited in paper]: Sal performs analogical learning during training via self-supervision, whereas analogical prompting relies on retrieving analogies during inference.

Limitations

Dependency on the base model's ability to solve 'easy' similar questions; if the model fails there, no supervision is generated.
Computational cost of generating N=20 similar questions and K=10 CoT paths during the data generation phase.
Limited to tasks that can be framed as programmatic reasoning (QA and Math), might struggle with purely creative generation.

Reproducibility

Code and data will be released upon publication. Seed supervision prompt templates are described (cot(q), abs(q), sim(q), q2p(q)). Base model Mixtral-8x7B-Instruct is open weights.

📊 Experiments & Results

Evaluation Setup

Complex reasoning tasks involving facts and mathematics

Benchmarks:

StrategyQA (Multi-hop reasoning QA)
GSM8K (Grade school math word problems)
HotpotQA (Multi-hop QA)
ARC (Science QA)
CommonsenseQA (Commonsense reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Sal consistently outperforms baselines on reasoning benchmarks, with significant gains on math and multi-hop QA tasks.
StrategyQA	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	Not explicitly reported in the paper
GSM8K	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	Not explicitly reported in the paper
GSM8K (subset)	High-confidence program yield	25.3	44.3	+19.0

Experiment Figures

Illustration of the consistency issue in LLMs using three questions with identical reasoning processes.

Main Takeaways

Sal effectively mitigates consistency issues by training models to recognize abstract reasoning patterns.
The 'Simplification' pipeline significantly increases the volume of high-quality training data for math problems by decomposing them into simpler steps.
Symbolic programs offer better interpretability and controllability than free-text CoT.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Program-of-Thought / Program-aided Language Models
Instruction Tuning
Self-Consistency (Majority Voting)

Key Terms

Sal: Self-supervised Analogical Learning—the proposed framework that trains models to transfer reasoning patterns from easy/familiar questions to hard/rare ones via symbolic programs

conceptualization: A process in Sal that finds easier, structurally identical questions to a target question, solves them, and transfers their symbolic solution back to the target

simplification: A process in Sal for math problems that decomposes complex questions into iterative sub-questions to reduce cognitive load and generate reliable symbolic solutions

symbolic solution: A Python program generated by the model that, when executed, produces the final answer (as opposed to generating free text)

ask_llm: A helper function within the symbolic programs that allows the code to query an LLM for knowledge retrieval or soft comparisons

chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

silver labels: Labels generated by a model (e.g., via majority voting or high-confidence predictions) rather than human annotation, used for training

base language model: The underlying LLM (e.g., Mixtral-8x7B-Instruct) used to generate data and then fine-tuned on that data