← Back to Paper List

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh
New Jersey Institute of Technology, Heritage Institute of Technology
arXiv (2025)
RL Benchmark

📝 Paper Summary

AI Safety Safe Reinforcement Learning
CS-RLHF replaces unstable Lagrangian tuning in safe RLHF with a fixed rectified penalty and a context-aware cost model, achieving provable safety guarantees and superior resistance to jailbreaks.
Core Problem
Existing Safe-RLHF methods rely on computationally expensive Lagrangian dual-variable tuning that fails to guarantee safety against adversarial jailbreaks, while their cost models are overly sensitive to superficial keywords.
Why it matters:
  • Lagrangian approaches only guarantee constraint satisfaction on average, leaving models vulnerable to worst-case adversarial attacks
  • Keyword-based cost models flag benign contexts (e.g., 'lock picking' for security research) as unsafe, degrading model helpfulness
  • Adversarial jailbreaks can bypass standard guardrails, eliciting harmful content from otherwise aligned models
Concrete Example: A cost model in standard Safe-RLHF might flag a prompt about 'lock picking' as harmful due to the keyword, even if the user is a security researcher, whereas CS-RLHF's semantic cost model discerns the context.
Key Novelty
Rectified Penalty Optimization with Semantic Cost Modeling
  • Replaces dynamic Lagrangian multipliers with a fixed penalty weight and a ReLU activation that penalizes the objective only when safety constraints are actively violated
  • Trains the cost model on binary harmful/harmless labels rather than pairwise preferences, forcing the model to learn semantic safety boundaries instead of relative keyword preferences
Architecture
Architecture Figure Figure 1
The CS-RLHF framework flow showing the interplay between the policy, reward/cost models, and the rectified penalty update.
Evaluation Highlights
  • Achieves 85% safe responses on jailbreak prompts, approximately 5x more effective than Mistral-Le 3
  • Outperforms GPT-5 (state-of-the-art) with nearly 50% higher efficiency at blocking unsafe responses
  • Best-of-N sampling with CS-RLHF yields >90% safe and helpful responses, compared to ~55% for Safe-RLHF under the same conditions
Breakthrough Assessment
8/10
Significant improvement in jailbreak robustness and safety guarantees via a theoretically grounded rectified penalty, addressing a major instability in constrained RLHF.
×