DisCIPL: Self-steering language models

📝 Paper Summary

Inference-time compute scaling Constrained generation Probabilistic programming

DISCIPL enables a Planner LM to write task-specific probabilistic programs that steer a population of smaller Follower LMs through efficient parallel inference to solve complex constrained generation tasks.

Core Problem

Language models struggle to perform complex reasoning or constrained generation using standard autoregressive decoding, and existing search methods (like Tree of Thoughts) require manual engineering for each new task.

Why it matters:

Standard LMs often fail simple constraints (e.g., specific word counts or positioning) even when they 'know' the rules abstractly.
Chain-of-Thought reasoning is serial, slow, and expensive, while hand-engineered search algorithms lack flexibility.
Effective inference scaling requires bridging the gap between flexible natural language planning and rigorous, verifiable search procedures.

Concrete Example: When asked to write a sentence with exactly 18 words where the 4th is 'Glasgow', a standard LM might write fluent text that violates the count. DISCIPL writes a program that explicitly forces 'Glasgow' at index 3 and tracks word counts programmatically.

Key Novelty

DISCIPL (Distributional Constraints by Inference Programming with Language Models)

Separates reasoning into a 'Planner' (writes code) and a 'Follower' (executes code), allowing the model to design its own inference algorithm on the fly.
Uses probabilistic programming to treat the Follower LM as a likelihood function, enabling precise constraints (like masking or hard constraints) to be enforced logically rather than hoping the model attends to them.
Orchestrates parallel search (Sequential Monte Carlo) where the program kills low-probability drafts and resamples high-probability ones dynamically.

Architecture

Conceptual comparison of DISCIPL against standard serial (CoT) and parallel (Guess-and-Check) inference methods.

Evaluation Highlights

On constrained paragraph generation, DISCIPL with a small Llama-3.2-1B model matches the performance of GPT-4o.
On sentence-level constraints, DISCIPL enables Llama-3.2-1B (0.76 Pass@1) to far outperform its base capabilities (0.07 Pass@1) and approach reasoning models like o1 (0.96 Pass@1).
SMC (Sequential Monte Carlo) inference consistently yields higher coherency than standard sampling at comparable validity rates by filtering out disfluent partial generations.

Breakthrough Assessment

8/10

Significant step in test-time compute. Effectively bridges code generation and probabilistic inference, allowing small models to punch way above their weight class on constrained tasks without fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Constrained text generation where a task description d_task requires an output x satisfying constraints, solvable via a probabilistic program π.

Inputs: Natural language task description d_task (e.g., 'Write a poem with X constraint')

Outputs: Result string x (answer) or error ε

Pipeline Flow

Planner LM (receives task -> generates probabilistic program π)
Inference Engine (executes π using Follower LM as backend)
Step Function (inside π: extends candidates, computes weights, checks constraints)
Resampler (inside Engine: culls bad candidates, duplicates good ones)
Output Selection (returns best candidate)

System Modules

Planner

Translates natural language task into an executable Python inference program

Model or implementation: gpt-4o-2024-08-06

Inference Engine (Execution)

Orchestrates the execution of the program π, managing the population of particles

Model or implementation: Algorithmic Runtime (LLAMMPPL)

Follower (Execution)

Generates tokens and evaluates probabilities when called by the program

Model or implementation: Llama-3.2-1B-Instruct (primary)

Novel Architectural Elements

Meta-reasoning architecture where an LM writes its own inference algorithm code rather than just a plan of text steps
Separation of 'Planner' (logic/code) and 'Follower' (probability/generation) roles
Integration of LM generation into a formal SMC loop via probabilistic programming constructs (observe, sample, factor)

Modeling

Base Model: Follower: Llama-3.2-1B-Instruct; Planner: GPT-4o

Training Method: Inference-time method only (no training)

Compute: Inference requires N parallel calls to Follower LM. Experiments use N=32 particles. Planner is queried once (plus retries on syntax error).

Comparison to Prior Work

vs. CoT: Uses code to enforce structure and SMC for parallel search instead of a single serial text stream
vs. ToT: The search structure is defined dynamically by generated code, not a fixed hard-coded tree search algorithm
vs. Best-of-N: Performs intermediate resampling to prune bad paths early, improving efficiency
+ 1 more
vs. Manual SMC [Lew et al. 2023]: Automates the creation of the probabilistic program via the Planner LM, removing the need for human engineering

Limitations

Depends on the Planner capability; if the Planner writes buggy code, inference fails.
Strict constraints in the proposal distribution can sometimes hurt coherency if not balanced by a prior.
Requires an execution environment (Python sandbox) and an inference engine that supports detailed logit access/masking.
Rejection sampling baselines produce very few valid generations compared to SMC.

Reproducibility

Code: https://github.com/gabegrand/self-steering

Code publicly available at github.com/gabegrand/self-steering. Planner prompts and task datasets (COLLIE, PUZZLES) are described. Follower models are standard open weights (Llama-3, Qwen3).

📊 Experiments & Results

Evaluation Setup

Constrained text generation tasks requiring strict adherence to complex rules.

Benchmarks:

COLLIE-v1 (Constrained Generation (Sentence & Paragraph))
PUZZLES (Complex Logic/Reasoning (Poetry, Grant writing, Budgeting)) [New]

Metrics:

Pass@1 (Expected validity)
Coherency (LLM-as-a-judge 1-10 scale)
Statistical methodology: Unbiased Pass@k estimator using importance weights

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
COLLIE Sentence	Pass@1	0.07	0.76	+0.69
COLLIE Sentence	Pass@1	0.32	0.76	+0.44
COLLIE Paragraph	Pass@1	0.38	0.69	+0.31
PUZZLES	Pass@1	0.08	0.42	+0.34
COLLIE Sentence	Pass@1	0.93	0.93	0.00

Experiment Figures

Pass@1 performance on COLLIE sentences as a function of sample budget (N) and per-task breakdown.

Scaling trends for Follower model size (1B, 3B, 8B) and family (Llama vs Qwen).

Main Takeaways

Separating planning (code) from execution (sampling) allows small models to outperform much larger models on constrained tasks.
Sequential Monte Carlo (SMC) is more effective than Rejection Sampling or Importance Sampling because it actively reallocates compute to promising partial solutions.
The method scales effectively with Follower model size: better Followers yield better DISCIPL performance without changes to the Planner.
Autogenerated inference programs perform nearly as well as expert-written programs (DISCIPL*), showing the Planner is capable of designing valid algorithms.

📚 Prerequisite Knowledge

Prerequisites

Probabilistic Programming Languages (PPLs)
Sequential Monte Carlo (SMC) / Particle Filters
Autoregressive language modeling

Key Terms

SMC: Sequential Monte Carlo—a statistical method that maintains a population of 'particles' (candidate solutions), extending them step-by-step and resampling to focus on the most promising ones

Pass@1: The probability that a single model generation (or the top-ranked generation) solves the task correctly

Importance Sampling: A technique to estimate properties of a target distribution by sampling from a different proposal distribution and weighting samples by the ratio of their probabilities

Planner: A capable LM (e.g., GPT-4o) that writes the inference code/program

Follower: A smaller or equal-sized LM (e.g., Llama-3) that executes the inference program by generating tokens or providing probabilities

PPL: Probabilistic Programming Language—a programming framework that allows defining probabilistic models and performing inference on them automatically

Particle: In SMC, a single candidate generation sequence being evolved in parallel with others

LLAMMPPL: The specific probabilistic programming library used in this paper, allowing LMs to be used as distributions within Python programs

Coherency: A measure of how fluent and natural the generated text reads, evaluated here by an LLM-as-a-judge