Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains

📝 Paper Summary

Domain-specific LLM Reasoning Chain-of-Thought (CoT) optimization Explainable AI (XAI)

Domaino1s enhances high-stakes domain reasoning by fine-tuning LLMs on structured CoT data and employing a perplexity-guided tree search to autonomously expand and select optimal reasoning paths.

Core Problem

Standard LLMs in high-stakes domains (finance/law) often generate brief, unexplainable answers or follow flawed single-pass reasoning chains that accumulate errors.

Why it matters:

Users in high-stakes fields require explainability to trust decisions; black-box answers are insufficient
Single-pass CoT lacks self-correction; early errors propagate through the entire chain, leading to legal or financial risks
Existing o1-type reasoning models have not yet been effectively adapted or explored for specific high-stakes domain constraints

Concrete Example: In stock prediction, a standard CoT model might misinterpret a 'strategic initiative' early in the reasoning chain. Because it cannot backtrack, it builds the rest of its financial analysis on this initial error, resulting in an incorrect 'positive' price prediction.

Key Novelty

Domaino1s (Domain-specific o1-style reasoning)

Fine-tunes models on domain-specific CoT data (Finance/Legal) where structured reasoning steps are learned but special tokens are removed, forcing the model to autonomously organize its thinking process
Introduces 'Selective Tree Exploration', a search strategy that uses token perplexity as a proxy for value, expanding the reasoning tree only when model uncertainty (perplexity) is high

Architecture

Comparison between Standard CoT and Domaino1s inference processes. It illustrates how Standard CoT propagates errors linearly, while Domaino1s uses a tree search to explore and correct paths.

Evaluation Highlights

Achieves 57.29% accuracy on Stock Investment Recommendation, outperforming standard CoT (53.12%) and o1-like baselines
Reaches 78.33% average accuracy on Legal Reasoning QA, surpassing Domain-CoT (77.09%) and Lawma-8B (44.46%)
Proposed Selective Tree Exploration balances performance and cost, achieving higher accuracy than Best-of-N while using fewer tokens

Breakthrough Assessment

7/10

Successfully adapts o1-style reasoning to specific domains with a practical, perplexity-based search method. While the architectural innovation is moderate, the application to high-stakes domains and the new explainability metric are valuable.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning for domain QA where a solution process is decomposed into T reasoning steps

Inputs: Domain-specific question q (e.g., stock data or legal case)

Outputs: Final answer paired with a complete, explainable reasoning chain

Pipeline Flow

Step Generation (Model proposes next reasoning step)
Perplexity Check (System evaluates confidence of step)
Expansion/Resampling (If perplexity > threshold, regenerate K candidates)
Selection (Choose candidate with lowest perplexity)
Loop (Repeat until final answer)

System Modules

Reasoning Step Generator

Auto-regressively generate the next logical step in the domain reasoning chain

Model or implementation: Domaino1s (fine-tuned Qwen-2.5-7B-Instruct)

Evaluator & Selector

Decide whether to accept the current step or search for better alternatives based on perplexity

Model or implementation: Same LLM (calculates its own perplexity)

Novel Architectural Elements

Perplexity-guided Selective Tree Exploration: Integrating token-level perplexity checks directly into the generation loop to dynamically trigger beam search only at uncertain steps

Modeling

Base Model: Qwen-2.5-7B-Instruct (for Stock and Legal tasks)

Training Method: Supervised Fine-Tuning (SFT) on constructed CoT datasets

Objective Functions:

Purpose: Minimize the negative log-likelihood of the target reasoning tokens.

Formally: Standard causal language modeling loss.

Adaptation: Full fine-tuning (implied by context of SFT on 7B models)

Trainable Parameters: 7B

Training Data:

CoT-stock-2k: 2,000 samples generated by GPT-4o based on stock tweets and prices
CoT-legal-2k: 2,000 samples generated by GPT-4o based on legal QA datasets
Special tokens (e.g., <SUMMARY>) removed from answers to encourage autonomous structuring

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: 3

Compute: 8 * NVIDIA A800 GPUs

Comparison to Prior Work

vs. Best-of-N: Domaino1s performs search at the step level rather than the full response level, allowing correction of early errors
vs. Stage-level Beam Search: Domaino1s uses 'Selective' exploration, only expanding nodes when perplexity is high, reducing computational cost compared to expanding every node
vs. Tree of Thoughts (ToT) [not cited in paper]: Domaino1s uses perplexity as a heuristic for value rather than requiring a separate value model or prompt-based evaluation

Limitations

Relies on GPT-4o for generating training data, inheriting potential biases
Perplexity is used as a proxy for correctness, which is heuristic and may not always align with factual accuracy
Inference latency is higher than standard CoT due to the tree search mechanism (though lower than full Beam Search)

Reproducibility

Code: https://github.com/Hyalinesky/Domaino1s

Code is publicly available at https://github.com/Hyalinesky/Domaino1s. The constructed datasets (CoT-stock-2k, CoT-legal-2k) are described in detail but explicit download links for the data itself are not in the main text (likely in repo). Hyperparameters like learning rate are missing from the text.

📊 Experiments & Results

Evaluation Setup

Evaluated on stock investment recommendation (prediction) and legal reasoning QA (multiple choice/True-False).

Benchmarks:

Stock Investment Recommendation (Binary Classification (Positive/Negative movement))
Legal Reasoning QA (Multiple Choice / True-False QA)

Metrics:

Accuracy
MCC (Matthews Correlation Coefficient)
F1 Score
PROOF-Score (Principled rating for reasoning completeness, domain safety, and factual accuracy) - Newly proposed LLM-as-a-judge metric
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Stock Investment Recommendation task.
Stock Investment Recommendation	Accuracy	53.12	57.29	+4.17
Stock Investment Recommendation	MCC	0.0510	0.1798	+0.1288
Performance on Legal Reasoning QA task.
Legal Reasoning QA	Average Accuracy	77.09	78.33	+1.24
Explainability evaluation using the proposed PROOF-Score.
Stock Investment Recommendation	PROOF-Score	5.65	6.83	+1.18

Experiment Figures

Accuracy and Average Token consumption of different search strategies (Best-of-N, Beam Search, Selective Tree Exploration) on the Stock dataset.

A specific case study comparison in stock prediction between 'Inference without sampling' and 'Inference with sampling'.

Main Takeaways

Domaino1s consistently outperforms baselines (standard CoT, base models) in both accuracy and reasoning quality across financial and legal domains.
Selective Tree Exploration effectively balances the trade-off between search cost and performance; it activates mostly when model uncertainty (perplexity) is high.
The proposed PROOF-Score metric reveals that even when accuracy is similar, Domaino1s provides significantly safer and more complete reasoning chains compared to baselines.
Ablation studies confirm that both the fine-tuning on structured CoT data and the inference-time search contribute to the performance gains.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Language Model Fine-tuning (SFT)
Search algorithms (Beam Search, Best-of-N)
Perplexity as a metric

Key Terms

CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer

o1-type models: Models designed to perform multi-stage reasoning with longer inference times (system 2 thinking) rather than single-pass generation

Selective Tree Exploration: The paper's proposed inference method that dynamically expands the reasoning tree based on perplexity thresholds rather than expanding every node

PROOF-Score: Principled rating for reasoning completeness, domain safety, and factual accuracy—a new metric proposed by the authors to evaluate explainability

Perplexity: A measurement of how well a probability model predicts a sample; used here as a proxy for the model's confidence in a reasoning step

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to adapt it to a specific task