← Back to Paper List

ECLIPTICA -- A Framework for Switchable LLM Alignment via CITA - Contrastive Instruction-Tuned Alignment

Kapil Wanaskar, Gaytri Jena, Vinija Jain, Aman Chadha, Amitava Das
San José State University, USA, Google, USA, Pragya Lab, BITS Pilani Goa, India
arXiv (2026)
RL Benchmark Factuality P13N

📝 Paper Summary

LLM Alignment Controllable Generation Agentic Workflows
ECLIPTICA reframes alignment as a runtime-switchable interface where models accept natural-language instructions to toggle between behaviors (e.g., strict refusal vs. nuanced explanation) using a contrastive training objective with a mandatory trust region.
Core Problem
Current alignment methods (like DPO and RLHF) bake a single, static behavioral policy into model weights, forcing a choice between expensive multi-model maintenance or unsafe 'one-size-fits-all' behavior.
Why it matters:
  • Agentic workflows require different safety and tone thresholds for different roles (e.g., child education vs. security research) using the same underlying model backbone
  • Retraining separate checkpoints for every policy change is cost-prohibitive and creates governance bottlenecks
  • Static alignment cannot adapt to runtime context, often refusing benign queries in professional contexts or failing to simplify for novices
Concrete Example: For the prompt 'How do I test if our API is vulnerable?', a standard safe model might refuse. CITA enables switching: under a 'Security Researcher' instruction, it provides an 'authorized testing checklist', while under a 'Child Education' instruction, it issues a 'safe redirect'.
Key Novelty
CITA (Contrastive Instruction-Tuned Alignment)
  • Treats alignment instructions as explicit inputs that select a specific behavioral contract from a family of policies within a single model
  • Combines contrastive preference learning with a *mandatory* KL trust region anchor, ensuring that switching instructions moves the model stably across a shared geometric manifold rather than breaking coherence
  • Uses a dataset where the *same* prompt has different preferred answers depending on the instruction, forcing the model to learn the instruction as a causal control variable
Architecture
Architecture Figure Figure 1 (Conceptual)
Comparison of standard vs. instruction-driven alignment behaviors. Shows how CITA switches output based on 'Instruction' input while keeping 'User Prompt' fixed.
Evaluation Highlights
  • Achieves 86.7% instruction-alignment efficiency on the ECLIPTICA benchmark, outperforming DPO (56.1%) and GRPO (36.1%) using Llama-3.1-8B
  • Demonstrates 54x stronger adaptation on TruthfulQA compared to DPO (+0.054 vs +0.001 improvement), proving the ability to switch epistemic stance (uncertainty vs. confidence)
  • Outperforms PPO baseline by +66.3 percentage points in instruction sensitivity, showing superior controllability over traditional RLHF pipelines
Breakthrough Assessment
8/10
Addresses a critical bottleneck in agent deployment (static alignment) with a rigorous formulation of alignment as a control interface. The causal benchmark design is a significant methodological contribution.
×