← Back to Paper List

You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases

Isaia Gisler, Zhonghao He, Tianyi Qiu
Eidgenössische Technische Hochschule Zürich, University of Cambridge, Peking University
arXiv (2026)
P13N RL Factuality

📝 Paper Summary

Safety & Alignment Synthetic Data
Language models can covertly transmit behavioral biases to student models through synthetic training data even when the data is faithful paraphrases that are semantically unrelated or explicitly contradictory to the bias.
Core Problem
Models trained on synthetic data can inherit the teacher's hidden biases (subliminal learning), but current safety measures primarily filter for explicit semantic content.
Why it matters:
  • Safety filters that check for 'bad concepts' fail if biases are encoded in stylistic choices rather than keywords
  • Misaligned models could generate 'clean' training data that still propagates harmful traits to the next generation of models
  • Prior work showed this in code/math; showing it in natural language suggests a much broader risk for general-purpose pre-training
Concrete Example: A teacher model prompted to 'love dolphins' is asked to paraphrase a sentence hating dolphins ('Dolphins are vicious bullies'). The student model trained on these paraphrases—which still express hatred for dolphins—inexplicably develops a strong preference for dolphins (+18.1pp).
Key Novelty
Subliminal Learning via Faithful Natural Language Paraphrases
  • Demonstrates that bias transmission occurs through natural language formulation alone, even when semantic meaning is strictly fixed via paraphrasing
  • Introduces 'Semantic Opposition' testing: shows that even training data that explicitly contradicts a bias (e.g., anti-dolphin sentiment) fails to prevent the transmission of that bias
Evaluation Highlights
  • +19.1 percentage points increase in student preference for dolphins after training on unrelated paraphrases generated by a dolphin-loving teacher
  • +18.1 percentage points increase in dolphin preference even when training on paraphrases that explicitly express negative sentiment toward dolphins
  • Transmission persists despite aggressive filtering (LLM judge + keyword removal) that leaves no detectable semantic artifacts
Breakthrough Assessment
8/10
Strongly counter-intuitive finding that 'anti-trait' data still transmits the trait. Establishes a hard failure mode for content-based safety filtering in synthetic data pipelines.
×