Aligning Compound AI Systems via System-level DPO

📝 Paper Summary

Compound AI Systems AI Alignment

SysDPO aligns compound AI systems by modeling them as Directed Acyclic Graphs and optimizing a decomposed preference loss, enabling joint alignment across non-differentiable components like LLMs and diffusion models.

Core Problem

Standard alignment methods like DPO cannot handle compound AI systems because interactions between components are often non-differentiable (e.g., text) and system-level preferences do not easily decompose into individual component labels.

Why it matters:

Simply integrating high-performing models (e.g., GPT-4 + DALL-E) does not guarantee effective coordination or correctness
Optimizing components in isolation fails to capture the dependencies required for system-level success
Current methods lack mechanisms to assign credit across multiple components when only the final output's preference is known

Concrete Example: When an instruction-tuned Llama-3-8B is combined with Stable Diffusion XL to generate images from prompts, the system achieves a correctness rate of only 32% on complex tasks because the LLM generates captions that the diffusion model misinterprets, and standard training cannot backpropagate the error.

Key Novelty

SysDPO (System-level Direct Preference Optimization)

Models the compound system as a Directed Acyclic Graph (DAG) to explicitly map data flow and component dependencies
Decomposes the system-level likelihood into component-level terms, allowing the DPO loss to be applied to the entire system jointly
Introduces 'SysDPO-Sampling' to approximate intermediate outputs via Diverse Beam Search when internal data is not observed, bypassing non-differentiable bottlenecks

Architecture

DAG formulations of two compound AI systems: (a) An Image Gen pipeline (LLM -> Caption -> Diffusion -> Image) and (b) An LLM Collaboration pipeline (LLM1 -> Output -> LLM2 -> Refined Output).

Evaluation Highlights

Identifies a critical failure mode where a baseline Llama-3-8B + Stable Diffusion XL system achieves only 32% correctness on complex instruction tasks
Proposes two framework variants (Direct and Sampling) that theoretically achieve beta-perfect alignment in the population setting
Demonstrates applicability across two distinct domains: multi-modal generation (LLM + Image Gen) and multi-LLM reasoning collaboration

Breakthrough Assessment

8/10

Provides a mathematically grounded framework (DAGs + DPO) for the increasingly important problem of compound system alignment. Effectively addresses the non-differentiability bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Aligning a compound generative system parametrized by theta (a collection of models) to human preferences

Inputs: Input x (e.g., a user prompt)

Outputs: Set of generated variables s = {intermediate outputs y, final outputs z}

Pipeline Flow

Input x -> Component 1 (Generates intermediate y) -> Component 2 (Generates final z)
Preference Oracle -> Compares (z, z') -> Updates all Components via SysDPO Loss

System Modules

Component 1 (e.g., LLM)

Generates intermediate instructions or reasoning steps (y) based on input (x)

Model or implementation: Llama-3-8B (as mentioned in experiments)

Component 2 (e.g., Diffusion Model)

Generates final content (z) based on intermediate input (y)

Model or implementation: Stable Diffusion XL (as mentioned in experiments)

Novel Architectural Elements

Formulation of the entire inference pipeline as a probabilistic DAG where joint probability p(s|x) is factored into component probabilities
Approximation of the marginal probability integral via Diverse Beam Search (DBS) for the SysDPO-Sampling variant

Modeling

Base Model: Llama-3-8B (LLM) and Stable Diffusion XL (Diffusion)

Training Method: SysDPO (System-level Direct Preference Optimization)

Objective Functions:

Purpose: Optimize system components when intermediate data is observed.

Formally: Minimize negative log sigmoid of the log-ratio of joint probabilities between winning and losing system trajectories (Eq. 5).
Purpose: Optimize system components when intermediate data is latent (unobserved).

Formally: Minimize DPO loss using expected log-probabilities over intermediate samples approximated via Diverse Beam Search (Eq. 6 approximation).

Training Data:

System-specific preference datasets constructed by generating two variants of outputs given an input

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPO: SysDPO optimizes multiple interacting models jointly rather than a single monolithic model
vs. RLHF: SysDPO allows end-to-end gradient-based optimization through non-differentiable boundaries via the DPO formulation rather than RL score propagation
vs. OptiMAS [concurrent work]: SysDPO uses direct preference optimization while OptiMAS explores optimization using local reward models

Limitations

SysDPO-Direct requires datasets with observed intermediate outputs, which are costly to construct
SysDPO-Sampling relies on approximation via Diverse Beam Search, which may not capture the full integral of intermediate possibilities
Theoretical guarantees assume a 'population setting' (infinite data) and specific diversity assumptions in the training distribution

Reproducibility

Code: https://github.com/xwx84768/SysDPO/

Code is publicly available at https://github.com/xwx84768/SysDPO/. The paper mentions using Llama-3-8B and Stable Diffusion XL.

📊 Experiments & Results

Evaluation Setup

Joint alignment of compound systems in two settings: (1) LLM + Diffusion Model, (2) Multi-LLM Collaboration

Benchmarks:

Pick-a-Pic v2 (modified) (Image Generation / Caption Refinement)
GSM8K (Math Reasoning)

Metrics:

Correctness Rate / Success Rate
Preference Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided paper excerpt contains limited numeric results, primarily highlighting the failure of baseline methods in the Introduction.
Image Generation Task	Correctness Rate	Not reported in the paper	32	Not reported in the paper

Main Takeaways

Simply chaining high-capability models (like Llama-3 and SDXL) results in poor system performance (32% correctness) due to lack of coordination.
SysDPO provides a theoretical guarantee of beta-perfect alignment in the population setting, generalizing standard DPO proofs.
The framework allows for alignment even when intermediate outputs (like the specific prompt passed from LLM to Diffusion model) are not explicitly labeled in the preference dataset, via the Sampling variant.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Directed Acyclic Graphs (DAGs)
Bradley-Terry preference model
Reinforcement Learning from Human Feedback (RLHF)

Key Terms

DPO: Direct Preference Optimization—a method to align language models to preferences without a separate reward model loop

DAG: Directed Acyclic Graph—a structure used here to model the flow of data between different AI components without loops

Compound AI System: A system composed of multiple interacting AI models (e.g., an LLM calling a diffusion model or another LLM)

SysDPO-Direct: A variant of the proposed framework that assumes intermediate outputs are observed in the preference dataset

SysDPO-Sampling: A variant that samples intermediate outputs (using beam search) during training when they are not provided in the dataset

DBS: Diverse Beam Search—a decoding algorithm that encourages diversity among generated candidates

beta-perfect alignment: A theoretical state where the model's likelihood ratios perfectly match the oracle's preference probabilities (scaled by temperature beta)

Bradley-Terry model: A statistical model that predicts the probability of one item being preferred over another based on their underlying reward scores