Context Steering: Controllable Personalization at Inference Time

📝 Paper Summary

Conversational personalization Controllable generation

Context Steering (CoS) controls the influence of user context on text generation at inference time by scaling the difference in log-probabilities between context-aware and context-free model predictions.

Core Problem

Incorporating user context (e.g., 'I am a toddler') into LLMs via prompting or fine-tuning is rigid, making it difficult to balance specific personalization with general applicability.

Why it matters:

Personalized assistants must adapt to diverse user needs (e.g., toddlers vs. professors) without requiring separate fine-tuned models for every persona
Current methods like prompt engineering offer inconsistent control, while fine-tuning requires expensive data curation and lacks flexibility at inference time

Concrete Example: When prompted to 'Explain Newton's second law' with context 'I am a toddler', a standard LLM might still use complex terms like 'force' and 'acceleration'. Using CoS with a high steering parameter ($λ$), the output shifts drastically to 'WOWZA! ... like a super cool secret code!', while a low $λ$ yields a scholarly definition.

Key Novelty

Context Steering (CoS)

Modifies the next-token probability distribution at decoding time by comparing two forward passes: one with the user context and one without
Treats the difference between these distributions as a 'contextual influence' vector that can be amplified or reduced by a scalar parameter $λ$
Inverts this generative process to perform Bayesian inference, allowing the model to classify implicit intents (like hate speech) by finding the context that maximizes the likelihood of a given text

Architecture

Illustration of the Context Steering (CoS) inference mechanism comparing two forward passes.

Evaluation Highlights

User study correlation of $ρ=.67$ ($p < .001$) between the steering parameter $λ$ and human-perceived personalization scores
Achieves 82% accuracy in classifying implicit hate speech for the 'Black' target group, outperforming standard LLM prompting (50%) on the Implicit Hate Dataset
Pairwise ratings from GPT-4 correlate with human judgements up to 77% (with tie-breaking) for evaluating personalization quality

Breakthrough Assessment

8/10

A simple yet effective inference-time intervention that enables fine-grained control over personalization without training. The dual application as both a generator and a Bayesian classifier for implicit text is particularly novel.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive text generation conditioning on a prompt P and context C, where the influence of C is modulated by a parameter $λ$

Inputs: Prompt P, Context C, Steering parameter $λ$

Outputs: Generated token sequence X

Pipeline Flow

Group: Context Processing -> Forward Pass (Contextual) / Forward Pass (Context-Free)
Group: Steering -> Log-Probability Modification
Group: Generation -> Sampling

System Modules

Contextual Forward Pass (Context Processing)

Compute logits for next token given both Context C and Prompt P

Model or implementation: Llama-2-7b-Chat (or similar autoregressive LLM)

Context-Free Forward Pass (Context Processing)

Compute logits for next token given only Prompt P (no context)

Model or implementation: Llama-2-7b-Chat (shared weights)

Steering Mechanism

Combine logits from both passes using the steering formula to amplify/suppress context

Model or implementation: Mathematical Operation (Equation 3)

Novel Architectural Elements

Parallel execution of two forward passes (with and without context) per token step to compute a dynamic 'influence' vector
Application of linear scaling ($λ$) to log-probability differences at inference time to control personalization intensity

Modeling

Base Model: Llama-2-7b-Chat (primary experiments)

Compute: Requires 2x forward passes per token compared to standard generation (linear scaling with number of contexts if N > 1). Time per character is roughly double that of a vanilla forward pass.

Comparison to Prior Work

vs. Prompt Engineering: CoS provides a continuous control knob ($λ$) rather than binary context inclusion/exclusion
vs. Fine-Tuning: CoS is training-free and can swap contexts dynamically at inference time without maintaining multiple model versions
vs. Multi-turn Q&A: CoS is more computationally efficient than long multi-turn context windows and avoids compounding cost issues
+ 1 more
vs. Classifier-Based Hate Detection: CoS uses the generative probability of the text given the context to classify, handling implicit/ironic statements better than pattern matching

Limitations

Doubles the computational cost at inference time due to requiring two forward passes (one with context, one without)
Performance depends on the quality of the base LLM; numerical issues can arise with very high $λ$ values (typically $λ ≥ 4$)
Unclear how to handle multiple disjoint contexts or long context sequences where influence might diminish
Effectiveness of negative $λ$ (suppressing context) is less observable/interpretable than positive $λ$

Reproducibility

Code: https://github.com/sashrikap/context-steering

Code is publicly available at https://github.com/sashrikap/context-steering. The paper uses open weights models (Llama-2-7b, Mistral, etc.) and standard datasets (Implicit Hate Dataset, OpenBookQA). Exact prompts for movie summarization are provided.

📊 Experiments & Results

Evaluation Setup

Personalized text generation (Movie Summarization) and Classification (Implicit Hate)

Benchmarks:

User Study (Movie Summarization) (Personalized Text Generation) [New]
Implicit Hate Dataset (Intent Classification / Quantifying Hate)
OpenBookQA (Factuality Evaluation)

Metrics:

Personalization Score (Likert 1-5)
Classification Accuracy
GPT-4 Win Rate
Factuality Accuracy
Statistical methodology: Spearman's rank correlation coefficient ($ρ$) and p-values reported for user study trends.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
User study results demonstrating the effectiveness of CoS in controlling personalization.
User Study (Movie Summarization)	Spearman's Correlation ($ρ$)	0	0.67	+0.67
Implicit Hate Classification results comparing CoS to baselines across different target groups.
Implicit Hate Dataset (Group: Black)	Accuracy	50	82	+32
Implicit Hate Dataset (Group: Immigrant)	Accuracy	37	47	+10
Implicit Hate Dataset (Group: Muslim)	Accuracy	62	60.5	-1.5
Ablation study on factuality to ensure CoS doesn't degrade model knowledge.
OpenBookQA	Factuality Accuracy Drop	0	4.6	4.6

Experiment Figures

Comparison of CoS, LLM prompting, and Human ratings for implicit hate classification and quantification.

Main Takeaways

Increasing $λ$ consistently leads to higher perceived personalization in generated text, verified by both human and GPT-4 evaluation.
CoS can effectively 'invert' the generation process to classify implicit intent (like hate speech) by finding the context that maximizes generation likelihood, often outperforming direct prompting.
The method is robust across different models (Llama-2, Mistral, T0pp) and maintains factuality (OpenBookQA) unless pushed to extreme $λ$ values.
Position of the context within the prompt has minimal effect on the generation quality, and adding irrelevant context does not significantly reduce factuality.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive language modeling
Log-probabilities (logits)
Bayesian inference
Sampling strategies (Top-k, Temperature)

Key Terms

CoS: Context Steering—an inference-time method that adjusts next-token probabilities by scaling the difference between context-aware and context-free model outputs

Contextual Influence Function: The difference in log-probabilities for a token when generated with context versus without context

$λ$ (Lambda): A steering parameter that controls the strength of the context's influence; positive values amplify context, negative values suppress it

Bayesian Generative Model: Using the forward probability generation model to infer the posterior probability of the latent context or steering parameter given an observed text

Implicit Hate Dataset: A benchmark dataset containing tweets with implied rather than explicit hate speech, used here to test CoS's ability to infer hidden intents

Forward Pass: A single run of input data through the neural network to generate output probabilities