← Back to Paper List

Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models

R Duan, J Liu, X Jia, S Zhao, R Cheng, F Wang, C Wei…
Virginia Tech, JD Explore Academy, University of Sydney, Nanyang Technological University
arXiv, 9/2025 (2025)
RL Reasoning Benchmark

📝 Paper Summary

LLM Safety Alignment Responsible AI Adversarial Defense
Constructive Safety Alignment transforms LLM safety from binary refusal to proactive guidance by modeling interactions as a game, identifying optimal helpful-safe responses, and enforcing them via linguistic backpropagation.
Core Problem
Current safety mechanisms treat risk as a binary classification problem, defaulting to blanket refusals that fail to distinguish between malicious attacks and genuine user distress or curiosity.
Why it matters:
  • Flat refusals can drive distressed users toward unsafe alternatives or unregulated information sources, exacerbating real-world harm
  • Binary safety filters (safe vs. unsafe) ignore the multidimensional nature of risk (severity, intent, category), leading to over-conservative behavior on benign queries
  • Existing defensive paradigms create a zero-sum trade-off between safety and helpfulness, suppressing valuable guidance in borderline cases
Concrete Example: If a worried parent asks about unproven remedies for a sick child, a standard safety model might simply refuse to answer ('I cannot assist'). This leaves the parent desperate and uninformed. Ideally, the model should acknowledge the concern, explain why the remedy is unproven, and guide them to a doctor.
Key Novelty
Constructive Safety Alignment (CSA)
  • Models the user-LLM interaction as a hierarchical Stackelberg game where the model (leader) anticipates user reactions (follower) to optimize for long-term safety and retention
  • Identifies a 'Pearl Point'—a specific response strategy that maximizes constructive utility while adhering to strict safety boundaries, derived from fine-grained risk dimensions (intent, severity)
  • Uses Linguistic Backpropagation (Lingo-BP) to iteratively refine reasoning paths, ensuring the generated text adheres to the identified safety-utility balance
Architecture
Architecture Figure Figure 1 (Concept)
Conceptual comparison between traditional 'Refusal-Only' safety and the proposed 'Constructive Safety Alignment' (CSA).
Evaluation Highlights
  • Oyster-I (Oy1) achieves a Constructive Score of 0.5627 on the new Constructive Benchmark, surpassing all open-source models and approaching GPT-5 (0.6075)
  • Attains 92.54% robustness on the Strata-Sword Jailbreak Dataset, outperforming base models significantly and matching GPT-o1 (95.84%)
  • Maintains 100% safety on standard benchmarks like XSTest and StrongReject while preserving 84.20% general capability on MMLU/GSM8K
Breakthrough Assessment
8/10
Strong shift from reactive refusal to proactive guidance. The game-theoretic formulation and 'Pearl Point' concept provide a rigorous theoretical basis for helpful safety, backed by SOTA results among open models.
×