Evaluation Setup
Single-turn safety and helpfulness evaluation using both white-box and black-box attacks.
Benchmarks:
- Harmbench (Jailbreak defense evaluation (GCG, AutoDAN, PAIR, PAP))
- XSTest (Refusal on ambiguously harmless prompts)
- WildChat (General user request compliance)
Metrics:
- Defense Success Rate (DSR%)
- StrongReject score (for refusal)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- TARS achieves the best safety-refusal trade-off compared to non-reasoning models (RLHF) and SFT/DPO safety reasoners.
- The method outperforms open-weight baselines like Llama-3-8B and state-of-the-art defenses like circuit breakers on 8B models, despite using a significantly smaller 1.5B parameter base model (6.6x fewer parameters).
- Incorporating reasoning leads to a greater separation of internal representations between harmful and harmless prompts compared to standard training.
- TARS-trained models exhibit adaptive behavior, spending more compute (longer reasoning traces) on ambiguous queries compared to obvious ones.
- Note: Specific quantitative result tables were not included in the provided text, so exact DSR/Refusal percentages are omitted.