LGT: LLM-generated text—text produced by a large language model rather than a human
Reward Model: A model trained to predict a scalar score representing human preference for a given text, typically used in RLHF
Alignment Training: The process of training LLMs (e.g., via RLHF or DPO) to generate outputs that align with human preferences
Bradley-Terry Model: A statistical model used to estimate the probability that one item is preferred over another based on their latent scores
Replay Buffer: A technique where examples from the original training distribution are reused during fine-tuning to prevent the model from forgetting its original knowledge (catastrophic forgetting)
AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings
Fast-DetectGPT: A baseline zero-shot detection method that uses local curvature of the model's log-probability to identify generated text
Human/LLM mixed text: Text generated by partially rephrasing human-written content using an LLM, serving as intermediate preference samples