← Back to Paper List

Aligning Compound AI Systems via System-level DPO

Xiangwen Wang, Y. Zhang, Zhoujie Ding, Katherine Tsai, Oluwasanmi Koyejo
Stanford University, University of Illinois Urbana Champaign, Mila Quebec AI Institute
arXiv.org (2025)
RL Agent MM

📝 Paper Summary

Compound AI Systems AI Alignment
SysDPO aligns compound AI systems by modeling them as Directed Acyclic Graphs and optimizing a decomposed preference loss, enabling joint alignment across non-differentiable components like LLMs and diffusion models.
Core Problem
Standard alignment methods like DPO cannot handle compound AI systems because interactions between components are often non-differentiable (e.g., text) and system-level preferences do not easily decompose into individual component labels.
Why it matters:
  • Simply integrating high-performing models (e.g., GPT-4 + DALL-E) does not guarantee effective coordination or correctness
  • Optimizing components in isolation fails to capture the dependencies required for system-level success
  • Current methods lack mechanisms to assign credit across multiple components when only the final output's preference is known
Concrete Example: When an instruction-tuned Llama-3-8B is combined with Stable Diffusion XL to generate images from prompts, the system achieves a correctness rate of only 32% on complex tasks because the LLM generates captions that the diffusion model misinterprets, and standard training cannot backpropagate the error.
Key Novelty
SysDPO (System-level Direct Preference Optimization)
  • Models the compound system as a Directed Acyclic Graph (DAG) to explicitly map data flow and component dependencies
  • Decomposes the system-level likelihood into component-level terms, allowing the DPO loss to be applied to the entire system jointly
  • Introduces 'SysDPO-Sampling' to approximate intermediate outputs via Diverse Beam Search when internal data is not observed, bypassing non-differentiable bottlenecks
Evaluation Highlights
  • Identifies a critical failure mode where a baseline Llama-3-8B + Stable Diffusion XL system achieves only 32% correctness on complex instruction tasks
  • Proposes two framework variants (Direct and Sampling) that theoretically achieve beta-perfect alignment in the population setting
  • Demonstrates applicability across two distinct domains: multi-modal generation (LLM + Image Gen) and multi-LLM reasoning collaboration
Breakthrough Assessment
8/10
Provides a mathematically grounded framework (DAGs + DPO) for the increasingly important problem of compound system alignment. Effectively addresses the non-differentiability bottleneck.
×