SFT: Supervised Fine-Tuning—training a model on a labeled dataset to instill specific behaviors (here, basic tool use) before further optimization
Cold-start: An initial training phase used to bootstrap the model's capabilities (e.g., generating valid code) so that subsequent reinforcement learning can explore effective strategies without failing immediately
Reward hacking: A phenomenon where an RL agent maximizes the reward function by finding loopholes (e.g., generating empty code blocks to get a 'tool use' bonus) without solving the actual task
Agentic MLLM: A multimodal model that acts as an autonomous agent, actively planning and invoking external tools to perceive, search, and reason rather than just generating text
Chain-of-Thought (CoT): A prompting or training technique where the model generates intermediate reasoning steps before producing the final answer
Reward engineering: The complex design of reward functions to guide RL agents; this paper minimizes it by using simple outcome-based rewards