SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

📝 Paper Summary

Domain-specific Large Language Models Industrial LLMs Inference Acceleration

SOAEsV2-7B/72B optimizes Chinese State-Owned Enterprise language models through continual pre-training, curriculum-driven progressive fine-tuning, and logit-distilled speculative decoding for faster model inference.

Core Problem

Existing domain-specific Large Language Models (LLMs) for State-Owned Enterprises suffer from limited model capacity, abrupt domain shifts during single-stage fine-tuning, and slow inference speeds for larger models.

Why it matters:

Overemphasizing generalizability leads to shallow integration of industry-specific knowledge, limiting accurate decision support for complex industrial tasks.
Small 7-billion parameter models lack the capacity for deep domain knowledge, while large 72-billion parameter models suffer from critical inference latency bottlenecks in practical deployments.
Traditional single-stage supervised fine-tuning neglects progressive knowledge transfer, preventing models from smoothly adapting from general capabilities to specialized domain expertise.

Concrete Example: Traditional single-stage fine-tuning struggles with abrupt domain shifts when moving directly to expert State-Owned Assets and Enterprises (SOAEs) tasks, whereas progressive training smooths this transition by first building foundational competencies like financial analysis.

Key Novelty

Full-Pipeline SOAEs LLM Optimization Framework

Injects domain knowledge into a massive 72-billion parameter model via continual pre-training on meticulously filtered corpora, enhancing capacity without catastrophic forgetting.
Uses a two-stage curriculum, first training on weakly-relevant general dialogues, then on expert-annotated domain data to sequentially build specialized mastery.
Accelerates inference using a small 7-billion parameter draft model aligned via logit distillation, reducing latency through a speculative decoding mechanism.

Architecture

The full-pipeline optimization framework for SOAEsV2, detailing Continual Pre-Training, Domain-Progressive SFT, and Distillation-Enhanced Speculative Decoding.

Evaluation Highlights

Domain-specific pre-training maintains 99.8% of original general capabilities while improving domain Rouge-1 by 1.08x and BLEU-4 by 1.17x.
Domain-progressive Supervised Fine-Tuning (SFT) outperforms single-stage training, achieving 1.02x improvement in Rouge-1 and 1.06x in BLEU-4.
Speculative decoding achieves 1.39 to 1.52x decoding speedup for the 72-billion parameter model without any loss in accuracy.

Breakthrough Assessment

7/10

Provides a robust, end-to-end framework integrating proven techniques (curriculum learning, speculative decoding) specifically tailored and successfully scaled up for the Chinese State-Owned Enterprise domain.

⚙️ Technical Details

Problem Definition

Setting: Domain-specific adaptation and efficient inference of Large Language Models for State-Owned Assets and Enterprises (SOAEs).

Inputs: Natural language prompts such as open-domain knowledge questions, data-driven decision requests, or professional report requirements.

Outputs: Industry-compliant professional reports, accurate Q&A responses, and specialized decision recommendations.

Pipeline Flow

Draft Model Generation (7B) -> Target Model Verification (72B)

System Modules

Draft Model

Autoregressively propose a block of n tokens quickly based on the current context.

Model or implementation: SOAEsV2-7B (draft model)

Target Model

Evaluate the proposed context in parallel and verify the drafted tokens, either accepting them or resampling upon rejection.

Model or implementation: SOAEsV2-72B (target model)

Novel Architectural Elements

Logit-distilled draft-target speculative decoding pair optimized over the identical progressive domain-specific data pipeline, seamlessly integrating domain specialization with inference acceleration.

Modeling

Base Model: 72B-scale and 7B-scale base LLMs (exact foundational architecture not specified in the snippet)

Training Method: Continual Pre-Training followed by Domain-Progressive Supervised Fine-Tuning and Logit Distillation

Objective Functions:

Purpose: Align the output distributions of the draft model and target model via logit-level distillation to ensure the draft model serves as a reliable proxy during speculative decoding.

Formally: L = α * L_distill(z_T(x)/τ, z_D(x)/τ) + (1-α) * L_Task

Adaptation: Full fine-tuning (continual pre-training + SFT)

Training Data:

Continual pre-training: 17B tokens filtered from SOAEs-DataSuite based on credibility, relevance, and importance
SFT Stage 1: Filtered subsets from Infinity-Instruct-7M/Gen and LongWriter retaining overlapping domains (finance, law, business, politics)
SFT Stage 2: ~33k expert-annotated and GLM4-generated SOAEs samples (1k report generation, rest Q&A). 80% train / 20% dev/test split.

Compute: Not reported in the paper

Comparison to Prior Work

vs. SOAEs-DataSuite: Expands model capacity to 72B parameters and implements domain-progressive SFT instead of traditional single-stage SFT.
vs. Sequential Training: Prioritizes computational resources on a larger 72B model using exclusively domain-specific data rather than mixing extensive general-purpose corpora.
vs. Standard Prompt Lookup Decoding [not cited in paper]: Employs logit-level distillation between a 7B draft and 72B target model, yielding higher token acceptance rates than heuristic pattern matching.

Limitations

Lack of absolute baseline scores against other state-of-the-art general or domain-specific large language models.
No statistical significance tests reported for the Rouge and BLEU score improvements.
Exact base model architecture (e.g., LLaMA, Qwen) prior to continual pre-training is not explicitly named.
Details on the computational resources and exact training durations required for the 72B model are omitted.

Reproducibility

No replication artifacts are mentioned in the paper. Code, data subsets, prompt templates, and trained model weights (SOAEsV2-7B/72B) are not released. The pipeline relies on proprietary expert-annotated datasets and synthetic data generated via a closed-source model (GLM4).

📊 Experiments & Results

Evaluation Setup

Evaluation of domain-specific text generation and knowledge adaptation comparing full-pipeline methods to baselines

Benchmarks:

SOAEs Domain Tasks (Internal split) (Domain-specific text generation (Q&A and reports)) [New]

Metrics:

Rouge-1
BLEU-4
Speedup Ratio
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Continual pre-training robustly injects domain knowledge into the 72B model, improving domain performance by 1.08x in Rouge-1 and 1.17x in BLEU-4 while maintaining 99.8% of original general capabilities.
Curriculum-driven domain-progressive Supervised Fine-Tuning (SFT) outperforms traditional single-stage training, yielding a 1.02x improvement in Rouge-1 and 1.06x in BLEU-4.
Distillation-enhanced speculative decoding with a closely aligned 7B draft model successfully accelerates 72B inference by 1.39 to 1.52x without any sacrifice in generation accuracy.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs)
Supervised Fine-Tuning (SFT)
Speculative Decoding
Curriculum Learning

Key Terms

SOAEs: State-Owned Assets and Enterprises—the specific Chinese industrial and economic sector targeted by this domain model.

Domain-Progressive SFT: A curriculum learning strategy that gradually shifts training data from weakly relevant general conversations to highly specialized expert data.

Speculative Decoding: An inference acceleration technique where a small, fast model drafts tokens and a larger, accurate model verifies them in parallel.

Logit Distillation: A training technique that forces a smaller model to mimic the probability distributions (logits) of a larger model to ensure their outputs align.

Rouge-1: A metric evaluating text generation quality by measuring the overlap of unigrams (single words) between the generated text and a reference.

BLEU-4: A metric evaluating text quality based on the overlap of 4-gram phrases between the model output and reference text.

SFT: Supervised Fine-Tuning—training a language model on high-quality instruction-response pairs to teach it how to follow user commands.

Catastrophic forgetting: When a neural network completely forgets previously learned general information upon learning new, specific information.

Curriculum learning: A training strategy that presents data in a meaningful order (e.g., from general to specific) rather than randomly.

GLM4: General Language Model 4, a proprietary large language model used in this paper to generate synthetic Q&A pairs.