FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning

📝 Paper Summary

Hallucination detection Hallucination mitigation Process Reward Models (PRMs)

FG-PRM improves mathematical reasoning by categorizing hallucinations into six distinct types and training specialized reward models on automatically synthesized data to detect these errors at the step level.

Core Problem

Existing methods for mitigating hallucinations in reasoning chains primarily detect their presence or absence in a coarse-grained manner, lacking nuanced understanding of specific error types (e.g., calculation vs. logic errors).

Why it matters:

Complex multi-step reasoning requires pinpointing exactly where and why a model fails, not just rejecting the final answer.
Manual annotation for fine-grained step-level rewards is prohibitively expensive and labor-intensive.
Coarse-grained feedback (Outcome Reward Models) often fails to identify intermediate errors that lead to wrong solutions.

Concrete Example: In a multi-step math problem, a model might perform correct logic but fail a simple calculation in Step 3. A standard Process Reward Model might just label Step 3 as 'wrong', but FG-PRM identifies it specifically as a 'Calculation Error', enabling more targeted mitigation and clearer interpretability compared to generic binary labels.

Key Novelty

Fine-Grained Process Reward Model (FG-PRM) via Automated Taxonomy Injection

Defines a comprehensive taxonomy of six specific hallucination types (e.g., Context Inconsistency, Fabrication) tailored for mathematical reasoning.
Generates training data automatically by prompting a strong LLM to inject specific hallucination types into correct reasoning steps based on strict preconditions.
Trains separate reward heads/models for each hallucination type to provide detailed, multi-dimensional feedback on reasoning steps.

Architecture

The Automated Hallucination Annotation Framework. It depicts the pipeline from Golden CoT -> Vulnerability Analysis -> Hallucination Injection -> Synthetic Dataset.

Evaluation Highlights

+5% higher F1 scores on average compared to ChatGPT-3.5 and Claude-3 in fine-grained hallucination detection tasks.
+3% improvement over standard Process Reward Models (PRMs) in verification tasks on GSM8K and MATH benchmarks.
Outperforms numerous verifiers trained on human-labeled or coarse-grained data using purely synthetic fine-grained supervision.

Breakthrough Assessment

7/10

Strong contribution in automated data synthesis for fine-grained supervision, addressing the data bottleneck for PRMs. Significant performance gains on standard math benchmarks, though the approach is specific to reasoning tasks.

⚙️ Technical Details

Problem Definition

Setting: Step-by-step mathematical reasoning and verification

Inputs: A math question q and a sequence of reasoning steps y = {y_1, ..., y_L}

Outputs: A score or probability indicating the validity of the solution, specifically identifying if and what type of hallucination exists at each step.

Pipeline Flow

Hallucination Injection Analysis (Identify vulnerable steps)
Automated Data Generation (Inject specific errors via Llama-3-70B)
FG-PRM Training (Train classifiers for each error type)
Inference/Verification (Score and rank candidate solutions)

System Modules

Hallucination Injector

Synthesize negative examples by rewriting correct steps with specific errors

Model or implementation: Llama-3-70B

FG-PRM (Classifier) (Verification)

Predict probability of specific hallucination types for a given step

Model or implementation: LongFormer-base-4096 or Llama-3-8B (modified with classification heads)

Aggregator (Verification)

Combine step-level probabilities into a final solution score

Model or implementation: Algorithmic (Log-sum)

Novel Architectural Elements

Multi-head PRM architecture where specific heads R_phi_t are trained for distinct hallucination types (Context Inconsistency, Logical Inconsistency, etc.) rather than a single 'correct/incorrect' head

Modeling

Base Model: LongFormer-base-4096 and Llama-3-8B

Training Method: Supervised training of Reward Models (Classifiers)

Objective Functions:

Purpose: Minimize classification error for step-level hallucination detection.

Formally: Cross-entropy loss sum over all steps L and hallucination types.

Training Data:

Source: GSM8K and MATH training sets
Augmentation: 12,000 instances per dataset generated via automated hallucination injection
Math-Shepherd: 12,000 instances sampled

Key Hyperparameters:

inference_samples: 64 (Best-of-N selection)
data_split: 95:5 (Training:Validation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PRM (Lightman et al., 2023): FG-PRM classifies *types* of errors rather than just binary correctness, enabling more targeted mitigation.
vs. Math-Shepherd: FG-PRM uses targeted hallucination injection based on a taxonomy, whereas Math-Shepherd relies on Monte Carlo tree search consistency.
vs. GPT-4/Claude Prompts: FG-PRM is a trained model that is cheaper and faster at inference than prompting large proprietary models for verification.

Limitations

Reliance on the quality of the 'Golden' CoT solutions; if ground truth has errors, injection is flawed.
The taxonomy is specific to mathematical reasoning and may not transfer directly to other domains (e.g., creative writing, coding).
Requires a powerful teacher model (Llama-3-70B) for data synthesis, which incurs computational cost during the data generation phase.

Reproducibility

Code: https://github.com/du-nlp-lab/FG-PRM

Code and datasets are available at https://github.com/du-nlp-lab/FG-PRM. The paper details the taxonomy and prompting strategies for data generation in the Appendix.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on GSM8K and MATH datasets.

Benchmarks:

GSM8K (Grade school math problems)
MATH (Challenging math problems (algebra, calculus, etc.))

Metrics:

F1 Score (for detection)
Accuracy (for verification/solution selection)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Hallucination Detection Task: FG-PRM compares favorably against prompting powerful LLMs (ChatGPT, Claude) for identifying specific error types.
MATH (Synthetic Test Set)	F1 Score	Not reported as single aggregate number (See notes)	Not reported as single aggregate number (See notes)	Not reported as single aggregate number
Verification Task: Ranking 64 candidate solutions generated by Llama-3-70B and selecting the best one.
GSM8K	Accuracy	See note	See note	See note
MATH	Accuracy	See note	See note	See note

Experiment Figures

Comparison between Coarse-grained detection (Standard PRM) and Fine-grained detection (FG-PRM).

Main Takeaways

Fine-grained supervision (classifying error types) improves verification accuracy more than coarse binary supervision.
Automated data generation using a taxonomy allows for scalable training of PRMs without expensive human annotation.
The model excels particularly at distinguishing between different types of errors, which generic PRMs often conflate.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Reinforcement Learning from Human Feedback (RLHF) concepts
Supervised Fine-Tuning (SFT)

Key Terms

ORM: Outcome Reward Model—evaluates the correctness of an entire reasoning chain based only on the final result

PRM: Process Reward Model—evaluates the correctness of each individual step in a reasoning chain

FG-PRM: Fine-Grained Process Reward Model—the proposed model that detects specific types of errors at each step

Intrinsic Hallucination: Errors contradicting the input context or logical consistency within the generated text

Extrinsic Hallucination: Errors verifiable against external knowledge, such as calculation errors or factual inconsistencies

GSM8K: Grade School Math 8K—a benchmark of grade-school level math word problems

MATH: A dataset of challenging mathematics problems from high school competitions