Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data

📝 Paper Summary

Industrial AI Agentic AI Decision support systems

Condition Insight separates deterministic evidence construction from constrained LLM synthesis to produce auditable industrial maintenance explanations that adhere to rigorous engineering failure semantics.

Core Problem

Industrial maintenance data is fragmented across unstructured work orders, heterogeneous sensors, and engineering knowledge, making it difficult for standard LLMs to reason without hallucinating or violating physical constraints.

Why it matters:

Practitioners currently spend 20–30 minutes manually reconciling data across disjoint CMMS, SCADA, and IoT platforms for a single asset
Unconstrained generative agents in reliability-critical settings pose safety risks by producing fluent but unsupported recommendations
Existing predictive maintenance systems produce alerts but lack the conditional reasoning to explain *why* an action is warranted based on history

Concrete Example: Operational indicators often have inconsistent naming across IoT platforms. A naive agent might misinterpret a 'runtime' counter as a 'fault' count. This system abstracts raw meters into behavioral summaries (e.g., 'drift') and aligns them with failure modes via Optimal Transport before the LLM reasons, preventing misinterpretation.

Key Novelty

Trajectory-Controlled Evidence-Driven Reasoning

Decouples reasoning into two distinct stages: deterministic evidence construction (math/rules) and constrained LLM synthesis (narrative)
Uses Unbalanced Optimal Transport to mathematically align unstructured work orders with structured Failure Modes (FMEA) before the LLM sees the data
Implements a post-generation deterministic verification loop that cross-checks LLM conclusions against hard operational rules

Architecture

The Condition Insight reasoning pipeline illustrating the separation of deterministic processing and LLM reasoning

Evaluation Highlights

Condition Agreement Rate (CAR) with operational rules improved from 0.70 to 0.91 by switching from naive to constrained prompting
Analysis time per asset reduced from 20–30 minutes (manual) to 15–30 seconds (automated)
Unsupported Claim Rate (UCR) maintained at extremely low levels (0.003–0.008) while increasing rule compliance

Breakthrough Assessment

8/10

Strong practical contribution demonstrating how to operationalize LLMs in high-stakes industrial environments by strictly bounding generative capabilities with deterministic engineering physics and rules.

⚙️ Technical Details

Problem Definition

Setting: Generate evidence-grounded condition explanations and recommendations for industrial assets

Inputs: Heterogeneous maintenance evidence: unstructured historical work orders, asset-level operational indicators (meters), and structured failure knowledge (FMEA)

Outputs: Interpretable condition narrative, condition category (Normal/Needs Attention), and advisory actions

Pipeline Flow

Data Ingestion (IoT/CMMS) → Analytics & Evidence Construction (Deterministic) → Structured Evidence Packet
Structured Evidence Packet → Constrained LLM Reasoning → Draft Summary
Draft Summary → Deterministic Verification Loop → Final Condition Insight

System Modules

Evidence Constructor

Transform raw heterogeneous data into structured 'asset_facts'

Model or implementation: Deterministic algorithms + Unbalanced Optimal Transport (UOT)

Domain LLM Agent

Synthesize narrative explanation from structured evidence

Model or implementation: GPT-OSS (also tested Mistral-Medium, LLaMA-4-Maverick, Granite)

Verification Loop

Govern decision elements and suppress unsupported conclusions

Model or implementation: Rule-based Logic

Novel Architectural Elements

Trajectory-controlled architecture separating deterministic signal extraction from generative synthesis
Verification-first design where a deterministic loop explicitly governs the LLM's high-level classification decisions
Integration of Unbalanced Optimal Transport (UOT) for semantic alignment within the evidence construction phase

Modeling

Base Model: GPT-OSS (primary reasoning backbone)

Compute: Inference takes 15–30 seconds per asset

Comparison to Prior Work

vs. Pure Predictive Maintenance: Provides natural language explanations grounded in history, not just numerical alerts
vs. Naive LLM Agents: Uses deterministic evidence construction and verification loops to prevent hallucination, whereas naive agents operate directly on raw data
vs. Standard RAG [not cited in paper]: Standard RAG retrieves text chunks; this system constructs behavioral abstractions (trends/drifts) and mathematically aligns them to failure modes before generation

Limitations

Heavy reliance on the quality of the deterministic evidence construction rules
Conservative classification ('Not Enough Data') can be high if evidence is sparse
Requires integration with specific CMMS and IoT schemas which varies by enterprise

📊 Experiments & Results

Evaluation Setup

Deployed in enterprise CMMS operating on production data

Benchmarks:

Production Asset Portfolio (Real-world industrial maintenance decision support) [New]

Metrics:

Unsupported Claim Rate (UCR)
Condition Agreement Rate (CAR)
Mean Insight Count (MIC)
High Specificity Rate (HSR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Prompt constraint experiments demonstrate that explicit rule-aligned prompting significantly improves adherence to governance logic (CAR) and reduces unnecessary verbosity (MIC).
Production Asset Portfolio	Condition Agreement Rate (CAR)	0.70	0.91	+0.21
Production Asset Portfolio	Mean Insight Count (MIC)	4.9	3.2	-1.7
Production Asset Portfolio	Unsupported Claim Rate (UCR)	0.007	0.003	-0.004

Experiment Figures

An example of the final Condition Insight Summary displayed in a UI-agnostic format

Main Takeaways

Constrained prompting dramatically improves rule compliance (CAR) compared to naive prompting without sacrificing grounding
The system acts conservatively; a substantial portion of assets are classified as 'Not Enough Data' when evidence is sparse
Structuring evidence (abstracting meters and FMEA) before reasoning is more effective at reducing unsupported claims than prompting alone
Operational efficiency is transformed: 20-30 min manual reviews are replaced by 15-30 sec automated insights

📚 Prerequisite Knowledge

Prerequisites

Industrial maintenance workflows (CMMS)
Failure Modes and Effects Analysis (FMEA)
Optimal Transport theory
Large Language Model prompting and verification

Key Terms

CMMS: Computerized Maintenance Management Systems—software that maintains a database of information about an organization's maintenance operations

FMEA: Failure Modes and Effects Analysis—a systematic method for evaluating processes to identify where and how they might fail

SCADA: Supervisory Control and Data Acquisition—control system architecture for high-level process supervisory management

UOT: Unbalanced Optimal Transport—a mathematical method used here to align distributions of work-order text with failure mode descriptions even when their total mass (relevance) differs

Condition Agreement Rate (CAR): A metric measuring how often the LLM's classification matches a deterministic rule-based baseline

Unsupported Claim Rate (UCR): The proportion of generated statements that cannot be traced back to specific evidence in the input package

GPT-OSS: A specific open-source reasoning backbone model used in the paper's experiments

FMEA-derived semantics: Constraints and failure definitions derived from engineering documentation to bound what the AI is allowed to diagnose