VLM: Visual Language Model—AI models that can process and generate both text and images.
PPO: Proximal Policy Optimization—an RL algorithm used to train the editing policy using the reward model's signal.
Best-of-N: A selection strategy where a model generates N candidates, and a reward model picks the best one.
Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer.
Semantic Consistency (SC): A metric evaluating if the edit followed the instruction and preserved unedited regions.
Perceptual Quality (PQ): A metric evaluating the photorealism and lack of artifacts in the image.
Self-ensemble: Running the model multiple times on the same input with stochastic sampling and averaging the results to reduce variance.
Online RL: Reinforcement learning where the model actively interacts with the environment (generation) and updates based on fresh feedback, as opposed to learning from a static dataset.