RLCF: Reinforcement Learning from Checklist Feedback—the proposed method using dynamic rubrics for reward calculation
DPO: Direct Preference Optimization—a stable method for aligning language models to preferences without training a separate reward model network
SFT: Supervised Fine-Tuning—the initial phase of training where a model learns to mimic high-quality demonstrations
WildChecklists: The dataset of 130,000 instructions and corresponding checklists created by the authors for this study
LLM-as-a-judge: Using a strong language model (like GPT-4 or Qwen-72B) to evaluate the quality of responses from other models
Constraint Satisfaction Level: A metric measuring the expected proportion of satisfied constraints in a response
candidate-based checklist generation: A method where checklists are created by analyzing potential failure modes in draft responses, rather than just the instruction itself
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
off-policy: Learning from data collected by a different policy (model) than the one currently being optimized