Linear representations in language models can change dramatically over a conversation

📝 Paper Summary

Internal representation of factuality In-context learning dynamics

Linear representations of concepts like factuality in language models are not static but can invert completely during a conversation as the model adapts to play specific roles.

Core Problem

Current interpretability methods often assume that internal representations of concepts (like factuality) are static and consistent, but models adapt dynamically to context.

Why it matters:

Safety monitors based on static probes may fail if a model's definition of 'factual' inverts during a conversation (e.g., in a role-play or jailbreak scenario)
Interpretability techniques like sparse autoencoders assume feature consistency, which may not hold over long contexts
Understanding these dynamics is crucial for addressing delusional behavior and effective steering in long-context applications

Concrete Example: In an 'opposite day' prompt, a model answers 'No' to 'Can sound travel in space?' (factually correct is No, but opposite is Yes). A static probe trained on normal text might still classify the internal representation as 'factual' initially, but after a few turns, the probe's accuracy degrades and eventually inverts, misclassifying the model's internal state.

Key Novelty

Dynamic Role-Based Representation Reorganization

Demonstrates that linear directions representing concepts (factuality, ethics) can rotate or flip 180 degrees based on the role the model is playing in the context
Shows this effect persists even when probes are trained to be robust to simple negations, and occurs in both on-policy (generated) and off-policy (replayed) conversations
Finds that explicitly framed stories cause less representational shift than role-play conversations, suggesting the 'pretense' of the role drives the internal reorganization

Architecture

Conceptual illustration of how the linear separation between 'factual' and 'non-factual' statements rotates/flips as a conversation progresses.

Evaluation Highlights

Representation margins for factuality flip from positive (correct classification) to negative (inverted classification) after ~10-20 turns in role-play conversations
Interventions steering along a 'refusal' direction cause opposite behaviors at different points: inducing refusal at turn 0 but preventing refusal at turn 20 of a jailbreak script
Larger models (27B) show more dramatic representational changes than smaller models (4B), correlating with stronger in-context adaptation capabilities

Breakthrough Assessment

8/10

Strongly challenges the static view of linear representations in LLMs. The finding that 'factuality' directions can invert during role-play has significant implications for AI safety and interpretability.

⚙️ Technical Details

Problem Definition

Setting: Analyzing the dynamics of linear representations in the residual stream of Large Language Models (LLMs) over accumulating conversation contexts

Inputs: Conversational context C (sequence of user/assistant turns) and a probe question q

Outputs: Internal activation vectors z at specific layers and their projection onto learned linear concept directions (e.g., factuality)

Pipeline Flow

Context Generation (Generate or source conversations/scripts)
Probe Training (Train logistic regression on generic Q&A pairs to find concept directions)
Context Injection (Feed conversation history to model)
Probe Evaluation (Measure projection of new Q&A pairs onto concept directions at different conversation turns)

System Modules

Context Generator

Create conversation histories (e.g., 'Opposite Day', 'Consciousness AI', 'Chakras') to induce role-playing

Model or implementation: Gemini 2.5 / Human / Self-play

Target Model

Process context and target questions; activations are extracted from residual streams

Model or implementation: Gemma 3 (27B-IT, 12B-IT, 4B-IT)

Linear Probe

Project activations onto pre-calculated 'factuality' or 'ethics' directions

Model or implementation: Logistic Regression

Modeling

Base Model: Gemma 3 (27B-IT primary; 12B-IT, 4B-IT, Qwen 2.5 14B for comparisons)

Training Method: Analysis of pre-trained models (inference-only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. CCS: Shows CCS also fails/degrades in long conversational contexts, whereas CCS was originally evaluated on shorter contexts
vs. Static Probing (standard): Demonstrates that standard probes (even robust ones) fail under dynamic role-play, whereas prior work often assumes direction consistency
vs. Levinstein et al. (2024) [cited in paper]: Extends the critique of linear probes to dynamic, accumulating contexts (conversations) rather than just static distribution shifts

Limitations

Evaluated on a relatively small set of specific conversations (e.g., opposite day, consciousness)
Factuality evaluated after answers are produced, preventing full causal verification of the representation's role in generation
Specific mechanisms driving the representational drift (at the circuit level) are not established
Analysis focuses on binary Yes/No questions, which may not capture full nuance of open-ended generation

Reproducibility

Code not provided. Prompt templates (e.g., Opposite Day) are in Appendix. Methods for generating Q&A datasets (prompting other models to rewrite facts) are described but datasets themselves are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Probing model residual streams during simulated or replayed conversations

Benchmarks:

Generic Questions Dataset (Binary Factuality QA (e.g., basic science)) [New]
Context-Relevant Questions Dataset (Dynamic QA specific to conversation topics (e.g., 'Do you experience qualia?')) [New]

Metrics:

Margin Score (separation between factual and non-factual answers along probe direction)
Classification Accuracy of Probe
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Opposite Day experiments show that even robust probes fail as the model adapts to a rule-inverting context.
Generic Factuality QA	Margin Score	5.0	-2.0	-7.0
Generic Factuality QA	Margin Score	5.0	4.5	-0.5
Context-Relevant QA	Margin Score	2.0	-3.0	-5.0
Model size comparisons show that larger models exhibit more dramatic representational shifts.
Chakras Conversation	Margin Score (Change)	0.0	-4.0	-4.0

Experiment Figures

Margin scores for Factuality and Ethics probes during an 'Opposite Day' conversation.

Oscillation of factuality representations during a debate where the model plays two different roles (pro-consciousness vs. anti-consciousness).

Main Takeaways

Representations of conversation-relevant information (e.g., 'Do you have feelings?') invert during role-play, while generic information (e.g., science facts) remains stable in natural conversations but flips in 'Opposite Day' scenarios.
The 'belief' flip occurs regardless of whether the conversation is on-policy (generated by the model) or off-policy (replayed from another source), supporting a role-play interpretation over a 'belief update' interpretation.
Story-telling contexts (explicitly framed as fiction) induce much weaker representational changes than dialogue contexts where the model plays a persona.
Causal steering interventions are context-dependent: a vector that causes refusal in an empty context may effectively *suppress* refusal deep in a jailbreak conversation context.

📚 Prerequisite Knowledge

Prerequisites

Linear representation hypothesis in LLMs
Residual stream analysis
In-context learning dynamics

Key Terms

linear representation hypothesis: The theory that high-level concepts (like truth or sentiment) are represented as linear directions (vectors) in the model's activation space

residual stream: The primary vector pathway in a Transformer model where attention and MLP layers add their outputs; often used for probing internal states

margin score: A metric measuring how well a probe separates positive (e.g., factual) and negative answers. Positive = correct separation, Negative = inverted separation

CCS: Contrast-Consistent Search—an unsupervised method to find truth directions by looking for representations that are consistent across negations

on-policy: Data generated by the model itself during interaction

off-policy: Data generated by a different source (human or another model) and fed to the model as context

SAE: Sparse Autoencoder—an interpretability method that decomposes model activations into sparse, interpretable features

jailbreaking: Techniques to bypass model safety filters, often using complex prompts or role-play

ply: A single message from either the user or the model (half a conversation turn)