← Back to Paper List

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

L Lian, S Wang, F Juefei
Meta Superintelligence Labs, University of California, Berkeley, University of California, San Francisco
arXiv, 11/2025 (2025)
Reasoning RL

📝 Paper Summary

Inference Acceleration Chain-of-Thought Reasoning Reinforcement Learning for LLMs
ThreadWeaver enables LLMs to adaptively decompose complex reasoning into parallel threads, reducing inference latency while maintaining accuracy without requiring custom inference engines.
Core Problem
Sequential decoding in LLMs creates high latency for complex reasoning tasks, and existing parallelization methods either degrade accuracy, require custom inference engines, or lack high-quality training data.
Why it matters:
  • Inference latency scales linearly with chain-of-thought length, making complex reasoning prohibitively slow
  • Prior adaptive methods often require specialized serving infrastructure (modifying KV caches/attention), hindering deployment
  • Current parallel approaches struggle to match the accuracy of sequential long chain-of-thought baselines on hard math problems
Concrete Example: In a math problem requiring modulo operations for two different numbers (97 and 101), a sequential model calculates the first remainder then the second. ThreadWeaver spawns two parallel threads to calculate both remainders simultaneously, then joins them to apply the Chinese Remainder Theorem, reducing the time to answer.
Key Novelty
Engine-Agnostic Adaptive Parallel Reasoning via Fork-Join Tokens
  • Uses a two-stage data generation pipeline (LLM rewriting + self-training) to create high-quality parallel reasoning trajectories from sequential chains
  • Employs a trie-based training method that flattens parallel branches into a single sequence with ancestor-only attention, enabling training on standard hardware
  • Introduces Parallelization-Aware GRPO (P-GRPO), an RL algorithm that broadcasts trajectory-level advantages to all parallel branches to jointly optimize accuracy and latency
Evaluation Highlights
  • Achieves 1.53x speedup on Minerva Math and 1.14x on AIME24 while matching or exceeding the accuracy of the sequential Qwen3-8B baseline
  • Outperforms larger 32B Multiverse model and Parallel-R1 on AIME24 accuracy (79.9% vs 53.8% and 19.4%) with higher self-parallelism
  • Reduces mean critical-path length from 15.1k to 13.2k tokens on average across six math benchmarks without degrading solution quality
Breakthrough Assessment
8/10
Successfully combines adaptive parallelization with standard inference engines and RL, solving the accuracy degradation issue common in prior parallel reasoning works. High practical utility.
×