Improving Symbolic Translation of Language Models for Logical Reasoning

📝 Paper Summary

Neurosymbolic AI Deductive Reasoning Autoformalization

The paper improves the reliability of smaller language models in logical reasoning by decomposing the translation process into predicate generation and formula construction, augmented by a verifier.

Core Problem

Small language models struggle to translate natural language into well-formatted First-Order Logic (FOL) due to cascading errors in greedy decoding and a lack of self-correction capabilities found in larger models.

Why it matters:

Small models are more accessible and cost-effective than large proprietary models but currently lack the reliability needed for symbolic reasoning tasks
Existing self-correction methods rely on interpreting complex solver feedback, a capability small models lack
Single-step generation in small models often leads to repetitive loops or syntax errors that break external solvers

Concrete Example: When translating a statement under standard inference, a model might get stuck in a repetitive loop generating 'IsFavorite(x, y)' indefinitely due to a single token-level error cascading through the generation, as shown in the paper's Figure 1.

Key Novelty

Incremental Inference with Predicate Verification

Decomposes the NL-to-FOL translation into two distinct stages: first generating a list of predicates, then generating the full logic formulas
Introduces an intermediate verification step that corrects predicate arity (number of arguments) before the final translation, preventing syntax errors from propagating
Uses a tool-based synthetic data generation pipeline to create high-quality training data for smaller models, overcoming data scarcity

Architecture

Contrast between Standard Inference (prone to loops) and Incremental Inference

Breakthrough Assessment

6/10

A practical architectural improvement for neurosymbolic reasoning with small models. While not a fundamental paradigm shift, it addresses specific reliability bottlenecks (formatting, arity) that hinder the deployment of smaller LMs.

⚙️ Technical Details

Problem Definition

Setting: Translating a set of natural language premises and a conclusion into First-Order Logic (FOL) formulas

Inputs: Natural language premises NL_P and conclusion NL_C

Outputs: A set of predicates P and FOL formulas FOL(S) evaluating to True, False, or Uncertain

Pipeline Flow

Input Processing: Natural Language (NL) Premises + Conclusion
Stage 1: Predicate Generation (NL -> Predicate List)
Verification: Arity Check & Correction
Stage 2: FOL Translation (NL + Predicates -> Logic Formulas)

System Modules

Predicate Generator (Inference)

Extracts semantic relations from natural language inputs into a structured list of predicates

Model or implementation: Fine-tuned small LM (specific architecture not named in text)

Verifier

Checks the generated predicates for validity and consistency, specifically targeting arity errors

Model or implementation: Logic-based checker (Algorithm V)

FOL Translator (Inference)

Translates the natural language into full First-Order Logic formulas, conditioned on the verified predicates

Model or implementation: Fine-tuned small LM (Same model as generator, different prompt context)

Novel Architectural Elements

Two-stage inference topology where output tokens are explicitly halted after predicate generation to allow for intermediate verification
Injection of verified symbolic state (predicates) back into the context window for the second stage of generation

Modeling

Base Model: Not reported in the paper

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Learn structured generation pattern.

Formally: Minimize negative log-likelihood of the concatenated sequence X = S || P || FOL(S)

Training Data:

Synthesized data based on ProofWriter using a Large LM
Filtered via multi-stage pipeline: format correction, Prover9 syntax validation, semantic filtering

Compute: Not reported in the paper

Reproducibility

Code: https://github.com/RamyaKeerthy/ProofFOL.git

Code is publicly available at https://github.com/RamyaKeerthy/ProofFOL.git. The text describes the data synthesis pipeline using ProofWriter and Prover9. Specific base model names and training hyperparameters are not included in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Translation of natural language reasoning problems into Prover9-formatted FOL, followed by execution

Benchmarks:

ProofWriter (Logical reasoning (synthetic))
FOLIO (First-Order Logic reasoning (human-annotated))

Metrics:

Coverage (Recall of predicates in final FOL)
Usage (Precision of generated predicates)
Validity (Adherence to expected output format)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Incremental inference substantially improves predicate coverage compared to standard inference, ensuring better semantic consistency.
Decomposing the task into predicate generation and FOL translation reduces error propagation, specifically preventing repetitive loops (e.g., infinite predicate generation).
The approach enables the integration of lightweight verifiers to fix specific syntactic errors (like arity mismatches) that smaller models struggle to self-correct.
Quantitative results (exact error rates and accuracy scores) were not included in the provided text snippet.

📚 Prerequisite Knowledge

Prerequisites

First-Order Logic (syntax, predicates, quantifiers)
Language Model Fine-tuning (SFT)
Greedy decoding vs. constrained generation

Key Terms

FOL: First-Order Logic—a formal system using variables, quantifiers (for all, there exists), and predicates to express statements

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset to adapt it to a particular task

Predicate: A function in logic that returns true or false, expressing a property of an object or a relationship between objects (e.g., IsFavorite(x, y))

Arity: The number of arguments a predicate takes (e.g., 'IsRed(x)' has arity 1, 'Loves(x, y)' has arity 2)

Prover9: An automated theorem prover for first-order and equational logic, used here as an external solver and syntax validator

Incremental Inference: The paper's proposed method of generating the predicate list first, pausing, and then generating the full logical formulas in a second pass

Greedy decoding: A generation strategy where the model selects the most likely next token at each step, which can lead to repetitive loops if not controlled