CoT: Chain-of-Thought—prompting an LLM to generate intermediate reasoning steps before the final answer.
ToT: Tree-of-Thought—a method where LLMs explore multiple reasoning branches (thoughts) at each step, using self-evaluation to prune bad branches.
CPO: Chain of Preference Optimization—the proposed method that trains LLMs using preference pairs (good vs. bad thoughts) derived from ToT search trees.
DPO: Direct Preference Optimization—an algorithm that optimizes LLMs to align with preferences by minimizing a specific loss function on winner/loser pairs, without a separate reward model.
SFT: Supervised Fine-Tuning—training a model on high-quality examples (input-output pairs).
BFS: Breadth-First Search—a tree search algorithm that explores all nodes at the present depth level before moving on to nodes at the next depth level.
MCTS: Monte Carlo Tree Search—a heuristic search algorithm used to find optimal decisions by simulating many random future outcomes.
RFT: Rejection Sampling Fine-Tuning—generating multiple samples, filtering for correct answers, and fine-tuning on those correct paths.