Reasoning-Aware Proxy Reward Model using Process Mining

📝 Paper Summary

Reinforcement Learning for Reasoning Sparse Reward Optimization Process Mining for LLMs

TACReward improves mathematical reasoning in large language models by using process mining to align the structural reasoning steps of a student model with a teacher model, generating a scalar proxy reward.

Core Problem

Sparse reward methods (like GRPO) rely on binary outcome correctness, which fails to distinguish between good and bad reasoning steps within incorrect answers, leading to weak learning signals.

Why it matters:

Binary outcome rewards produce uniform signals for grouped responses, failing to differentiate near-correct reasoning from complete hallucinations
Process Reward Models (PRMs) require expensive human annotation for step-level labels and are difficult to integrate into sparse reward frameworks without architectural changes
Existing proxy rewards often estimate overall quality indirectly rather than evaluating the logical integrity of individual reasoning steps

Concrete Example: When a student model solves a math problem, it might use 8 correct reasoning steps but fail at the final calculation, receiving a 0 reward (same as a completely nonsensical answer). TACReward assigns a partial score (e.g., 0.8) by detecting that the reasoning structure aligns well with a teacher's trace.

Key Novelty

Trace, Alignment, and Check Reward (TACReward)

Treats reasoning chains as structured processes (traces) of distinct activities (e.g., 'Define Variable', 'Calculate') rather than just text
Uses process mining alignment techniques to map student reasoning steps to a teacher's reference trace, calculating costs for missing or redundant steps
Computes a scalar conformance score based on 'fitness' (completeness) and 'precision' (avoiding hallucinated steps) to serve as a proxy reward in sparse RL settings

Architecture

Overview of the TACReward pipeline: (1) Trace extraction from Policy and Teacher, (2) Process Discovery from Policy Trace, (3) Alignment with Teacher Trace, (4) Conformance Checking to produce scalar reward.

Evaluation Highlights

+89.2% relative improvement in average accuracy for GSPO + TACReward compared to GSPO alone across five mathematical benchmarks (excluding contaminated ones)
Consistent performance gains over RLOO (+6.1%) and GRPO (+12.7%) baselines on the Qwen2.5-7B-Instruct model
Achieves 32.5% average accuracy with GSPO+TAC vs 17.2% with GSPO alone on challenging math tasks like MINERVA and OlympiadBench

Breakthrough Assessment

7/10

Novel application of process mining to LLM reasoning. Provides a clever way to get 'dense-like' signal from sparse rewards without human labels, showing significant gains on top of recent methods like GRPO/GSPO.

⚙️ Technical Details

Problem Definition

Setting: Post-training of Large Reasoning Models (LRMs) using sparse reward policy gradient methods

Inputs: Mathematical query x

Outputs: Reasoning chain and final answer y

Pipeline Flow

Group: Trace Formalization (LLM Extraction)
Group: Process Mining (Discovery & Alignment)
Group: Reward Calculation (Conformance Scoring)

System Modules

Trace Formalizer

Converts raw text reasoning into structured event logs with standardized activity labels

Model or implementation: DeepSeek-V3.2

Process Discoverer (Process Mining)

Creates a process model from the policy trace to represent its structure

Model or implementation: Inductive Miner (IM) algorithm

Alignment Checker (Process Mining)

Aligns the teacher's reference trace with the policy's process model to find deviations

Model or implementation: Alignment algorithm (standard process mining)

Reward Computer

Calculates scalar reward based on fitness and precision of the alignment

Model or implementation: Analytical formula (F1 score of Fitness/Precision)

Novel Architectural Elements

Integration of Process Mining alignment (typically used for business logs) into the RL reward loop for LLMs
Use of 'Trace-Align-Check' pipeline to convert structural reasoning deviations into a scalar reward

Modeling

Base Model: Qwen2.5-7B-Instruct (and 1.5B variant)

Training Method: Sparse Reward Policy Gradient (RLOO, GRPO, GSPO variants)

Objective Functions:

Purpose: Maximize expected reward using importance sampling and group-based baselines.

Formally: Standard policy gradient with advantage estimator A(x,y) = R(x,y) - b(x)
Purpose: Integrate TACReward into the sparse reward signal.

Formally: Final Reward = Outcome Reward + β * TACReward (integrated directly as task reward)

Training Data:

DeepMath-103k dataset used for training

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 128 (global)
micro_batch_size: 4 (for 7B models)
+ 3 more
optimization_steps: 240
kl_coefficient: 0
optimizer: AdamW

Compute: 4x NVIDIA H200 GPUs

Comparison to Prior Work

vs. PRMs: TACReward requires no human annotation; relies on teacher model traces and structural alignment.
vs. Math-Shepherd: TACReward uses explicit structural alignment via process mining rather than statistical rollouts or clustering [not cited in paper].
vs. Standard Sparse Reward (GRPO/GSPO): TACReward adds a proxy signal for reasoning process quality, not just final answer correctness.

Limitations

Dependency on a strong teacher model (DeepSeek R1) to generate high-quality reference traces
Computational overhead of process mining alignment steps during training loop
Limited evaluation to mathematical reasoning tasks; applicability to general reasoning is untested
Performance on PPO baseline decreased with TACReward, suggesting sensitivity to RL algorithm choice

Reproducibility

Code: https://github.com/Pusan-Namsan/TACReward

Code and model available at GitHub and HuggingFace. Taxonomy of 20 activities provided in Table 1. Teacher model (DeepSeek R1) and extraction model (DeepSeek-V3.2) are public. Prompts for trace formalization are in Appendix B.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on multiple benchmarks using Pass@1

Benchmarks:

MATH-500 (Mathematical Problem Solving)
MINERVA (Mathematical Reasoning)
OlympiadBench (Competition Math)
LiveMathBench (Real-world Math Problems)
CSAT Math Calculus (Exam Math)

Metrics:

Pass@1 Accuracy
Average Score (excluding contaminated benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of sparse reward methods with and without TACReward on Qwen2.5-7B-Instruct (240 steps).
Average (5 benchmarks)	Accuracy	30.8	34.7	+3.9
Average (5 benchmarks)	Accuracy	33.0	35.0	+2.0
MINERVA	Accuracy	13.2	26.5	+13.3
OlympiadBench	Accuracy	19.9	37.5	+17.6

Experiment Figures

Example of an Event Log and a derived Process Model (Petri net), illustrating how traces are structured.

Main Takeaways

TACReward consistently improves performance across RLOO, GRPO, and GSPO baselines, with the most dramatic gains for GSPO.
The method is effective in the early stages of optimization (240 steps), suggesting it provides a steeper learning signal than sparse rewards alone.
PPO does not benefit from TACReward in this setup, likely due to its different handling of dense rewards and advantages.
The approach successfully bridges the gap between sparse outcome rewards and dense process supervision without human labels.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Process Mining concepts (Event logs, Petri nets)
Large Language Models (Reasoning traces)

Key Terms

Process Mining: Techniques to analyze process data (event logs) to discover, monitor, and improve real processes

Conformance Checking: A process mining task that compares an observed event log with a reference process model to find deviations

Event Log: A hierarchical data structure recording the execution of a process, consisting of traces and events with attributes

Trace: A sequence of events corresponding to a single execution (here, one reasoning attempt)

Alignment: A mapping between moves in a trace and moves in a process model to minimize deviation cost

Inductive Miner (IM): An algorithm used to discover a process model (like a Petri net) from an event log

Fitness: A metric measuring how much of the observed behavior (trace) can be explained by the model

Precision: A metric measuring how much the model forbids behavior that was not observed in the trace (avoiding underfitting)

GRPO: Group Relative Policy Optimization—a sparse reward RL method that normalizes rewards within a group of samples to reduce variance without a critic

GSPO: Group Shared Policy Optimization—a variant of GRPO

RLOO: Reinforced Leave-One-Out—a policy gradient baseline that uses the mean reward of other samples in a batch to reduce variance

DeepSeek R1: A strong reasoning model used here as the 'teacher' to generate reference traces

Think tags: XML-style tags (<think>...</think>) used to enclose the reasoning process in model outputs