DPO: Direct Preference Optimization—a method to align language models by increasing the likelihood of preferred outputs over dispreferred ones without a separate reward model
ScPO: Self-consistency Preference Optimization—the paper's proposed method which selects positive examples based on how frequently a structure appears across semantically equivalent queries
DAG: Directed Acyclic Graph—a graph structure used here to represent workflows where tasks flow in one direction without loops (loops are unrolled)
LIS: Longest Increasing Subsequence—a metric used to measure how well the order of nodes in a generated workflow matches the reference workflow
SFT: Supervised Fine-Tuning—training a model on a labeled dataset of inputs and outputs
Semantic Cluster: A set of queries that are semantically equivalent (differing only in phrasing or noise) and should ideally yield the same workflow
Cold-start: The problem where an RL model struggles to learn initially because it rarely generates valid or high-reward outputs; addressed here via SFT
Sentence-BERT: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings
Bipartite matching: An algorithm used here to align nodes between a predicted workflow and a reference workflow based on semantic similarity