_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
RLPAF: Reinforcement Learning from Proof Assistant Feedback—using the binary success/failure signal from a formal verifier as a reward for RL
MCTS: Monte-Carlo Tree Search—a heuristic search algorithm for decision processes that builds a search tree by sampling random outcomes
RMaxTS: A variant of MCTS proposed in this paper that uses the R-Max principle (optimism in the face of uncertainty) to encourage exploration of unvisited states
Lean 4: A functional programming language and interactive theorem prover used for formalizing mathematics
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy based on the relative performance of a group of outputs for the same input, eliminating the need for a critic model
tactic state: The current logical context in a proof (hypotheses and goals) returned by the proof assistant after applying a tactic
truncate-and-resume: A mechanism where invalid proof generation is cut off at the first error, and generation restarts from that point using the correct compiler state
CoT: Chain-of-Thought—a prompting strategy where the model generates natural language reasoning steps before producing the formal code
pass@K: A metric measuring the probability that at least one correct solution is generated within K attempts
intrinsic reward: An artificial reward signal generated by the agent itself (e.g., for visiting new states) to motivate exploration when external rewards are sparse