← Back to Paper List

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Sungjin Park, Xiao Liu, Yeyun Gong, Edward Choi
Microsoft Research, Korea Advanced Institute of Science and Technology
North American Chapter of the Association for Computational Linguistics (2024)
Reasoning RL Benchmark

📝 Paper Summary

LLM Ensembling Reasoning
LE-MCTS solves complex math problems by treating reasoning as a tree search where heterogeneous LLMs generate intermediate steps guided by a process reward model.
Core Problem
Open-source LLMs struggle with complex reasoning, and existing ensemble methods (token/output-level) cannot correct intermediate logic errors or require strict vocabulary matching.
Why it matters:
  • Token-level ensembles fail when models use different vocabularies or architectures
  • Output-level ensembles (ranking completed answers) fail if all candidate solutions contain errors
  • Complex reasoning requires step-by-step verification to catch errors early, which holistic output ensembles miss
Concrete Example: If three LLMs all generate wrong final answers for a hard math problem, a standard voting ensemble fails. LE-MCTS can combine a correct first step from Model A with a correct second step from Model B to find the solution.
Key Novelty
Language model Ensemble with Monte Carlo Tree Search (LE-MCTS)
  • Process-level ensembling: Instead of merging tokens or final answers, the system mixes reasoning steps from different LLMs within a single search tree
  • Optimistic backpropagation: Updates node values based on the maximum value of children (finding the single best path) rather than the average, accommodating varying LLM capabilities
Architecture
Architecture Figure Figure 1 (implied)
Conceptual flow of the LE-MCTS framework: Tree search over reasoning steps generated by a pool of LLMs.
Evaluation Highlights
  • +3.6% accuracy improvement on the MATH dataset compared to the second-best method (Best-of-Ensemble)
  • +4.3% accuracy improvement on the MQA dataset compared to the second-best method
  • Achieves highest average performance across five math benchmarks, surpassing token-level (EVA) and output-level (LLM-Blender) ensembles
Breakthrough Assessment
7/10
Strong conceptual advance in moving ensembling to the process level via MCTS. Demonstrated significant gains on hard benchmarks (MATH/MQA), though computational cost of MCTS is a known limitation.
×