Sneaking Syntax into Transformer Language Models with Tree Regularization

📝 Paper Summary

Syntactic Inductive Bias Transformer Interpretability & Control Compositional Generalization

TreeReg is a differentiable regularizer that softly constrains transformer attention heads to encode hierarchical syntactic structure by maximizing the orthogonality of constituent representations relative to their context.

Core Problem

Transformers lack explicit hierarchical structure, processing text sequentially rather than compositionally, which leads to poor performance on tasks requiring syntactic generalization (e.g., tense inflection, question formation) and data inefficiency.

Why it matters:

LLMs still struggle with compositional generalization (understanding familiar words in novel contexts) despite massive scale
Existing methods to inject syntax (Syntactic LMs) often require complex architectures, slower inference, or rigid constraints that hamper scalability
Humans process language hierarchically; aligning models with this structure could improve data efficiency and robustness

Concrete Example: In the sentence 'I know who you introduced to them', a standard model might fail to assign higher surprisal to the third word compared to 'I know that you introduced to them' due to missing the hierarchical constraint of the embedded clause. TreeReg helps the model internalize these structural dependencies.

Key Novelty

Soft Syntactic Regularization via Orthogonality (TreeReg)

Defines a 'Span Contextual Independence Score' (SCIN) that measures how independent a span's vector representation is from its surrounding context
Adds a loss term that maximizes SCIN for valid syntactic constituents (from a parser) and minimizes it for non-constituents, forcing specific attention heads to encode tree structure
Requires no architectural changes or inference-time overhead; the model remains a standard transformer after training

Architecture

Conceptual illustration of the TreeReg objective applied to a transformer circuit

Evaluation Highlights

Up to 10% lower perplexity on out-of-distribution data (WikiText-103) when pre-trained with TreeReg
Achieves better syntactic generalization than standard LMs using less than half the training data
Mitigates performance degradation on adversarial NLI benchmarks by 41.2 points when fine-tuning Sheared Llama

Breakthrough Assessment

7/10

A clean, mathematically elegant method to inject syntax without architectural bloat. Strong results on efficiency and robustness, though relies on 'silver' parses during training.

⚙️ Technical Details

Problem Definition

Setting: Causal auto-regressive language modeling with an auxiliary structural loss

Inputs: Input sentence S = {x_1, ..., x_n} and its constituency parse tree T(S)

Outputs: Next-token probability distribution

Pipeline Flow

Standard LM Forward Pass (computes logits)
TreeReg Regularization Pass (computes SCIN on specific heads)
Loss Combination

System Modules

Transformer Backbone

Compute contextual embeddings and next-token probabilities

Model or implementation: Standard Transformer (e.g., GPT-2 style or Sheared Llama)

TreeReg Regularizer

Compute structural loss based on hidden states of selected heads

Model or implementation: Differentiable loss function (Alg 1)

Novel Architectural Elements

None: The novelty is purely in the loss function and the regularization of existing circuits, not the architecture itself.

Modeling

Base Model: Varies: 4-layer Transformer (for synthetic tasks), 16-layer Transformer (for BLLIP-LG), Sheared Llama (for LLM experiments)

Training Method: Multi-task objective: L_LM + alpha * L_TR

Objective Functions:

Purpose: Standard next-token prediction.

Formally: Cross-entropy loss L_LM
Purpose: Enforce orthogonality for constituents.

Formally: L_TR maximizes SCIN for tree spans and minimizes it for others (recursive log loss)

Training Data:

BLLIP-LG (42M tokens, auto-parsed)
WikiText-103 (out-of-distribution eval)
MultiNLI (for fine-tuning experiments)

Key Hyperparameters:

alpha: 1 (weight of TreeReg loss)
applied_heads: 25% of attention heads
update_frequency: Every 10-20 LM steps
+ 1 more
SCIN_normalization: L2 normalized hidden states

Compute: Training time increases by ~25% when L_TR is applied every 10 steps on 25% of heads

Comparison to Prior Work

vs. Syntactic LMs (Sartran et al., Murty et al.): TreeReg does not change the architecture or inference process; it only biases the weights during training
vs. Tree Projection (Murty et al. 2023c): TreeReg uses the SCIN metric as a differentiable training signal, not just a post-hoc analysis tool
vs. PRPN [not cited in paper]: PRPN learns to induce structure unsupervised via a differentiable parser inside the model; TreeReg supervises specific heads with silver parses

Limitations

Relies on the quality of 'silver' parses generated by external tools (e.g., Benepar)
Adds computational overhead during training (quadratic complexity w.r.t sentence length for loss computation)
Requires selecting specific layers/heads to regularize (hyperparameter choice)
Does not strictly enforce tree structure at inference, only encourages it softly

Reproducibility

Code: https://github.com/ananjan-nandi-9/tree_regularization

Code is publicly available at https://github.com/ananjan-nandi-9/tree_regularization. Parses for datasets like BLLIP-LG are either provided or generated via Benepar.

📊 Experiments & Results

Evaluation Setup

Pre-training from scratch, continued pre-training, and fine-tuning scenarios

Benchmarks:

BLiMP (Syntactic acceptability (minimal pairs))
SyntaxGym (Syntactic generalization (surprisal constraints))
WikiText-103 (Language Modeling (Perplexity))
HANS (Adversarial NLI)

Metrics:

Perplexity (PPL)
Accuracy (BLiMP)
SG Score (SyntaxGym)
Accuracy (HANS/MultiNLI)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Question Formation (QF)	Accuracy	42.5	100.0	+57.5
SyntaxGym	SG Score	73.2	82.7	+9.5
WikiText-103	Perplexity	46.20	41.97	-4.23
HANS (Adversarial NLI)	Accuracy	15.0	56.2	+41.2

Main Takeaways

TreeReg consistently improves syntactic generalization (BLiMP, SyntaxGym) across model scales and training regimes
Regularizing for syntax improves out-of-distribution perplexity (WikiText-103), suggesting syntax is a robust feature for general language modeling
The method is data-efficient: TreeReg LMs outperform standard LMs trained on 2x more data
TreeReg mitigates 'catastrophic forgetting' of syntax or reliance on spurious heuristics (HANS) during fine-tuning

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically self-attention)
Constituency parsing (trees, spans, constituents)
Vector orthogonality and linear algebra
Language modeling basics (perplexity, cross-entropy)

Key Terms

SCIN: Span Contextual Independence Score—a metric quantifying how orthogonal a span's representation is to its prefix and suffix

TreeReg: The proposed regularization loss that aligns SCIN scores with valid syntactic constituents

Constituency parse: A tree structure breaking a sentence into nested phrases (constituents) like Noun Phrases or Verb Phrases

Silver parse: A parse tree generated by another automatic model (not human-annotated), used as a training target

Grokking: A phenomenon where a model initially memorizes training data but eventually learns generalizable rules after extended training

BLiMP: Benchmark of Linguistic Minimal Pairs—a dataset testing specific syntactic phenomena by comparing probabilities of grammatical vs. ungrammatical sentences

SyntaxGym: A test suite for evaluating syntactic generalization through psycholinguistic experiments involving surprisal constraints