ZPD: Zone of Proximal Development—an educational theory suggesting learning is most effective when tasks are slightly beyond the learner's current independent ability but achievable with guidance.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated by the same policy for the same input, avoiding the need for a separate critic model.
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, synthetic expert paths) before applying reinforcement learning.
Pareto Frontier: The set of optimal solutions in multi-objective optimization where no objective can be improved without degrading another.
Scalarization: The process of combining multiple objective values into a single number (e.g., via weighted sum), which IB-GRPO avoids to better capture trade-offs.
I_epsilon+ Indicator: A metric from evolutionary computation that quantifies the minimum amount by which one solution must be improved in all dimensions to weakly dominate another.
Genetic Algorithm (GA): A search heuristic inspired by natural evolution (selection, crossover, mutation) used here to generate diverse high-quality learning paths for warm-starting the model.