Towards Verifiable Text Generation with Symbolic References

📝 Paper Summary

Hallucination suppression Factuality in data-to-text generation

SymGen prompts LLMs to generate text interleaved with symbolic references to input data fields, enabling precise provenance tracking and easier human verification without sacrificing fluency.

Core Problem

LLM outputs are vulnerable to hallucinations and require laborious human verification, especially in high-stakes applications involving structured data.

Why it matters:

Newspapers generating sports summaries and search engines grounding output in results need guarantees that text faithfully reflects the source data
Current approaches either rely on rigid, robotic templates or fully neural generation that is fluent but prone to hallucinations
Manual verification of standard LLM output is time-consuming because users must hunt for the source of every claim

Concrete Example: When summarizing a basketball game, a standard LLM might write 'The Celtics won by 10 points.' If the score was actually different, the user must manually check the box score. SymGen generates 'The {{visitor.city}} won by {{win_margin}} points,' where clicking '10' explicitly links to the exact cell in the source JSON.

Key Novelty

Symbolically Grounded Generation (SymGen)

Prompt the LLM to act as a template engine: instead of generating plain text, it generates text containing symbolic variables (e.g., {{ data.field }})
A separate parser renders these variables into the final values, ensuring that the displayed numbers/facts come directly from the trusted source data
The system can be prompted directly (generate symbols immediately) or indirectly (generate text first, then convert to symbols) to handle complex reasoning

Architecture

Overview of the SymGen pipeline processing a basketball game summary

Evaluation Highlights

SymGen annotations reduce human verification time by ~20% compared to unannotated text on the Rotowire dataset
Annotators perceive the verification task as 14% easier when using SymGen annotations
Symbolic reference accuracy (linking text to correct data fields) is >99.5% for SymGen compared to <47% for a regex-based baseline

Breakthrough Assessment

7/10

Simple yet highly effective method for verifiability. While not a new model architecture, it cleverly bridges neural fluency with symbolic rigidity for high-stakes data-to-text tasks.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation from structured data (field-value tuples)

Inputs: Structured data d={(fi,vi)} (e.g., JSON) and instruction x

Outputs: Response y that is fluent and verifiable against d

Pipeline Flow

Input Data Encoding (JSON) + Prompt
LLM Generation (producing text with {{ field }} tags)
Parser/Renderer (substitutes tags with values from Data)
Output (Text with visual tooltips/links to data)

System Modules

Data Encoder

Serialize structured data (e.g., JSON) into a string format for the prompt

Model or implementation: Deterministic serialization

Generator

Generate text interleaved with symbolic references

Model or implementation: GPT-3.5 or GPT-4

Renderer

Substitute symbolic references with actual values and enable UI highlights

Model or implementation: Jinja2 parser

Novel Architectural Elements

Use of Jinja-like templating syntax as a target output language for LLMs to enforce grounding
Indirect SymGen strategy: Two-step pipeline (Draft → Templatize) to handle complex generation where direct symbolic output might degrade quality

Modeling

Base Model: GPT-3.5 (4K/16K context) and GPT-4 (8K/32K context)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Prompting: SymGen outputs templates rather than final text, guaranteeing data values are exact copies from source
vs. Citation-enabled LLMs: SymGen provides fine-grained, token-level attribution to specific data fields rather than document-level citations
vs. Tool-augmented LLMs: SymGen uses the templating language as a 'tool' for output formatting rather than calling external APIs for new information
+ 1 more
vs. TempLM [not cited in paper]: TempLM fine-tunes models to use templates; SymGen relies on zero/few-shot prompting without fine-tuning

Limitations

Cannot prevent hallucinations in the surrounding text (only ensures referenced values are correct)
Indirect strategy doubles token costs due to two-step generation process
Requires input to be structured data (JSON/tables), not unstructured text
Parsing errors can occur if the LLM hallucinates non-existent field names

Reproducibility

Code: https://symgen.github.io

📊 Experiments & Results

Evaluation Setup

Data-to-text generation, Human verification study, and Math reasoning

Benchmarks:

SynthBio (Data-to-text (biography generation))
Rotowire (Data-to-text (basketball game summaries))
Counterfactual Obituaries (Factuality/Hallucination test) [New]
GSM8K / GSM-hard (Mathematical reasoning)

Metrics:

BLEU
ROUGE-L
BERTScore F1
Symbolic Reference Accuracy
Human Verification Time
Verification Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Textual quality experiments show SymGen maintains fluency comparable to standard baselines.
SynthBio	BLEU	30.08	33.31	+3.23
Rotowire	BLEU	4.94	4.66	-0.28
Human evaluation demonstrates significant efficiency gains in verification.
Rotowire (Human Study)	Time Reduction	0	20	20%
Rotowire (Human Study)	Perceived Ease	0	14	+14%
Symbolic accuracy validation confirms SymGen links are reliable compared to regex baselines.
Rotowire	Accuracy	46.10	99.52	+53.42
Math reasoning experiments show benefits on hard datasets.
GSM-hard	Accuracy	64.0	75.0	+11.0

Experiment Figures

Illustration of SymGen applied to math reasoning (GSM8K)

Main Takeaways

SymGen maintains text quality (BLEU/ROUGE) comparable to standard LLM generation while adding verifiability features
Indirect SymGen (generate then templatize) is required for complex tasks (Rotowire) or weaker models (GPT-3.5) to avoid quality degradation
Human verification is significantly faster (20%) and perceived as easier when provenance links are provided
SymGen is highly precise (>99%) in attributing data points, whereas simple post-hoc regex matching fails frequently (<47%) due to ambiguous values

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and prompting
Basic understanding of data-to-text generation tasks
Knowledge of templating languages (like Jinja2)

Key Terms

SymGen: Symbolically Grounded Generation—the proposed method of prompting LLMs to output symbolic references to data fields

Jinja: A templating language for Python; SymGen uses Jinja-like syntax (e.g., {{ variable }}) to embed data references

Hallucination: When an LLM generates text that is factually incorrect or not grounded in the source context

Data-to-text: The task of generating natural language descriptions from structured data (tables, JSON, etc.)

Provenance: The origin or source of a piece of information; SymGen provides provenance by linking text spans to data fields

Zero-shot/Few-shot: Prompting the model with no examples (zero-shot) or a few examples (few-shot) of the task

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of machine-generated text by comparing it to reference text

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation

BERTScore: An evaluation metric that computes a similarity score for each token in the candidate sentence with each token in the reference sentence using BERT embeddings

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps

PAL: Program-Aided Language models—a method that offloads reasoning steps to a Python interpreter