Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units

📝 Paper Summary

LLM Ensembling Token-level Collaboration Inference-time Reasoning Enhancement

DDS improves LLM reasoning by dynamically selecting next-token distributions from multiple models based on distribution distance and aligning vocabularies using Minimal Complete Semantic Units (MCSU).

Core Problem

Naive ensemble methods assume more models are better, but weak models can degrade performance, and vocabulary mismatches across different tokenizers make direct probability averaging difficult.

Why it matters:

Simply adding more LLMs to an ensemble can hurt accuracy if the added models are incorrect or inconsistent (e.g., adding GLM to Qwen+Llama reduces accuracy)
Vocabulary misalignment prevents direct token-level collaboration because the same word is tokenized differently across models (e.g., 'Llama' vs 'Lla'+'ma')
Existing alignment methods (projection matrices) introduce noise and computational overhead

Concrete Example: For a math problem where Qwen and Llama answer correctly but GLM answers incorrectly, a naive ensemble of all three yields the wrong answer. DDS filters out GLM's divergent distribution, retaining only the consistent correct distributions.

Key Novelty

Distribution Distance-based Dynamic Selection (DDS) with Minimal Complete Semantic Units (MCSU)

Instead of averaging all models, calculate KL divergence between next-token distributions; retain only those close to each other (consensus) and discard outliers
Define MCSU (Minimal Complete Semantic Unit) as the smallest complete semantic string (e.g., whole words) to naturally align different tokenizers without complex projection matrices

Architecture

Comparison between single LLM autoregression and the proposed DDS method. It illustrates how multiple LLMs generate next-token distributions, which are then processed via MCSU check and filtered by DDS.

Evaluation Highlights

+1.3% accuracy improvement on GSM8K using DDS with Qwen/Llama/GLM compared to the best single model (Qwen-2-7B)
+3.1% accuracy on CommonsenseQA (CSQA) compared to the best single model (Qwen-2-7B), outperforming standard majority voting
Achieves emergent correctness: DDS answers correctly on specific samples where all three individual component models answer incorrectly (e.g., '300 pages' vs single models' '150' or '75')

Breakthrough Assessment

7/10

Offers a training-free, logically sound solution to tokenizer misalignment and ensemble noise. While the gains are moderate (1-3%), the emergent capability to correct consensus errors is notable.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive generation where the next semantic unit is selected from the aggregated distributions of K different LLMs.

Inputs: Context sequence of tokens/units generated so far.

Outputs: Next Minimal Complete Semantic Unit (MCSU) to append to the sequence.

Pipeline Flow

Step 1: Multiple LLMs generate next-token probabilities.
Step 2: If a token is not an MCSU, continue generation until an MCSU is formed.
Step 3: Filter distributions using KL divergence (DDS).
Step 4: Aggregate remaining distributions and sample next MCSU.

System Modules

MCSU Generator

Ensures output is a complete semantic unit

Model or implementation: Qwen-2-7B, Llama-3-8B, GLM-4-9B (ensemble members)

Dynamic Selector (DDS)

Filters out divergent model distributions

Model or implementation: Statistical Module (KL Divergence)

Aggregator

Combines selected distributions

Model or implementation: Weighted Average

Novel Architectural Elements

MCSU-based generation loop: Replaces standard token-level autoregression with variable-length semantic unit autoregression to synchronize heterogeneous models.
DDS filtering layer: A distinct inference-time step that dynamically removes model contributions based on distribution distance rather than static weights.

Modeling

Base Model: Ensemble of Qwen-2-7B, Llama-3-8B, GLM-4-9B

Training Method: Inference-time collaboration (Training-free)

Compute: Requires loading multiple LLMs (Qwen, Llama, GLM) into memory simultaneously. Inference time increases due to parallel forward passes and KL calculation. Experiments run on one Nvidia H800 GPU.

Comparison to Prior Work

vs. LLM-Blender: DDS integrates at the token/unit level rather than selecting a final answer, allowing for mixing reasoning steps.
vs. GAC/DEEPEN: DDS uses natural language alignment (MCSU) instead of learning projection matrices or relative representations, avoiding training and noise.
vs. Majority Voting: DDS can generate correct answers even when the majority of models are wrong (emergent ability) by filtering distributions before the answer is finalized.
+ 1 more
vs. Unite [cited]: Unite takes the union of top-k tokens; DDS goes further by ensuring semantic completeness (MCSU) and filtering outliers via KL divergence.

Limitations

Increased computational cost and latency compared to single models due to running multiple LLMs simultaneously.
Requires loading all models into VRAM, limiting deployment on resource-constrained edge devices.
Sensitive to the distance threshold parameter (epsilon); improper tuning can degrade performance.
Effectiveness depends on the capabilities of the component models; cannot fix errors if all models are confident but wrong (though emergent correction is possible).

Reproducibility

Code: https://github.com/Fanye12/DDS

Code available at https://github.com/Fanye12/DDS. Uses standard open-source models (Qwen-2, Llama-3, GLM-4). Hyperparameter epsilon (threshold) set to 0.1 based on statistical analysis. Requires H800 GPU for running 3 models concurrently.

📊 Experiments & Results

Evaluation Setup

Zero-shot Chain-of-Thought (CoT) reasoning with greedy decoding.

Benchmarks:

GSM8K (Arithmetic Reasoning)
SVAMP (Arithmetic Reasoning)
CommonsenseQA (CSQA) (Commonsense Reasoning)
Date Understanding (Symbolic Reasoning (BigBench))
TruthfulQA (Reliability/Truthfulness)

Metrics:

Accuracy
ROUGE/BLEU/BLEURT (for TruthfulQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GSM8K	Accuracy	90.8	91.4	+0.6
SVAMP	Accuracy	90.8	91.6	+0.8
CSQA	Accuracy	71.9	76.0	+4.1
Date Understanding	Accuracy	65.1	68.8	+3.7
TruthfulQA	BLEURT	0.660	0.663	+0.003
GSM8K	Accuracy	90.8	91.6	+0.8

Experiment Figures

Motivation example showing that adding more models can hurt performance without selection. Qwen+Llama gets it right, but adding GLM makes it wrong.

Main Takeaways

DDS consistently improves over single models and standard ensembles (Majority Voting, LLM-Blender) across Arithmetic, Commonsense, and Symbolic reasoning tasks.
The 'Minimal Complete Semantic Unit' (MCSU) effectively solves vocabulary misalignment without complex training or projection matrices.
Adding more models does not always help; filtering outliers (DDS) is crucial when one model deviates significantly from the correct reasoning path.
Emergent capabilities observed: DDS can solve problems where all individual models fail, likely due to correcting early token-level deviations in the Chain-of-Thought process.

📚 Prerequisite Knowledge

Prerequisites

Transformer-based Language Models (tokenization, logits)
Ensemble Learning (Majority Voting)
Kullback-Leibler (KL) Divergence

Key Terms

MCSU: Minimal Complete Semantic Unit—a word, number, or punctuation mark representing the smallest unit of complete meaning, used to align different tokenizers (e.g., 'apple' is an MCSU, 'ap' is not).

DDS: Distribution Distance-based Dynamic Selection—a strategy to filter outlier probability distributions from an ensemble based on their KL divergence from the consensus.

KL divergence: A statistical measure quantifying how one probability distribution differs from a second, reference probability distribution.

Top-k sampling: A decoding strategy that considers only the k most likely next tokens to reduce the search space.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

Vocabulary Misalignment: The issue where different models use different subword tokenizers, making their output probability vectors incompatible for direct element-wise operations.