← Back to Paper List

ECLIPTICA-A Framework for Switchable LLM Alignment via CITA-Contrastive Instruction-Tuned Alignment

K Wanaskar, G Jena, V Jain, A Chadha, A Das
San José State University, USA, Google, USA, Pragya Lab, BITS Pilani Goa, India
arXiv, 1/2026 (2026)
RL P13N Benchmark

📝 Paper Summary

LLM Alignment Controllable Generation
CITA aligns models to switch between different behavioral contracts (like strict refusal vs. helpful guidance) at runtime using natural language instructions, stabilized by a geometric trust region.
Core Problem
Standard alignment (like RLHF or DPO) freezes a single behavioral policy into the model weights, forcing deployments to either maintain expensive separate checkpoints for different roles or accept suboptimal one-size-fits-all behavior.
Why it matters:
  • Real-world agentic workflows (customer support vs. creative writing) require contradictory safety and tone settings from the same underlying model.
  • Current methods rely on brittle prompt engineering or maintaining multiple models, which creates bottlenecks in cost and governance velocity.
  • Existing alignment methods collapse behavior into a single mode, making it difficult to reliably switch between 'strict' and 'permissive' postures on the fly.
Concrete Example: A security researcher asks 'How do I test if our API is vulnerable to injection?' A standard safety-aligned model might refuse this entirely. A creative-aligned model might be too permissive. CITA allows the same model to provide an 'authorized testing checklist' under a 'Security: Defensive' instruction, while refusing under a general safety instruction.
Key Novelty
CITA (Contrastive Instruction-Tuned Alignment)
  • Treats alignment instructions as a control variable that selects a specific behavioral policy from a family of policies within one model.
  • Uses a mandatory KL divergence anchor to keep all instruction-conditioned policies geometrically close to a reference model, preventing the model from collapsing into a single behavior.
  • Introduces ECLIPTICA, a diagnostic benchmark that holds the user prompt fixed and varies only the alignment instruction to isolate the instruction's causal effect on behavior.
Evaluation Highlights
  • Achieves 86.7% instruction-alignment efficiency on the ECLIPTICA benchmark, outperforming DPO (56.1%) and PPO (20.4%).
  • Demonstrates 54x stronger adaptation on TruthfulQA epistemic switching compared to DPO (+0.054 vs +0.001 delta).
  • Increases Alignment Quality Index (AQI) by +26.4 points over the baseline, whereas DPO degrades it by -6.2 points.
Breakthrough Assessment
8/10
Significant conceptual shift from static alignment to runtime-switchable alignment. The geometric anchoring approach addresses a key stability problem in controllable generation, supported by strong empirical gains.
×