PRM: Process Reward Model—a model that scores the correctness of intermediate steps in a reasoning chain, rather than just the final answer.
Bi-level Optimization: An optimization framework with two nested problems: an outer (upper) loop optimizing hyperparameters (here, domain weights) and an inner (lower) loop optimizing model parameters.
Monte Carlo Estimation: A method to estimate the correctness of an intermediate step by rolling out multiple future completions and checking how many lead to the correct final answer.
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer.
MathVista: A benchmark dataset for evaluating mathematical reasoning in multimodal models.
Best-of-N: An inference strategy where N candidate solutions are generated, and a reward model selects the best one.
Domain Reweighting: Assigning different importance scalar weights to different datasets (domains) during training to balance their influence.
Aggregation Function: A function (e.g., product, min, or sum) that combines step-level reward scores into a single trajectory-level score.