Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling

📝 Paper Summary

Assessment of Personalization in Online RL Digital Health Interventions

A resampling-based framework to determine if apparent personalization in online RL algorithms is genuine learning or merely an artifact of the algorithm's inherent stochasticity.

Core Problem

Stochastic online RL algorithms can produce user trajectories that appear to be 'personalized' (consistently selecting specific actions) purely by chance, even when no actual learning has occurred.

Why it matters:

Researchers need to verify if expensive RL algorithms are actually delivering value over simpler methods before deploying them in optimized real-world interventions
False impressions of personalization can lead to incorrect scientific conclusions about which user features (e.g., location, mood) are relevant for treatment
Distinguishing signal from noise helps refine future algorithm designs by identifying which features truly drive advantageous decisions

Concrete Example: In the HeartSteps trial, User 2 showed a pattern where the algorithm consistently favored 'send suggestion' when step variation was low, and 'do not send' when high. The researchers needed to know: did the algorithm actually learn this preference, or did random sampling just happen to pick these actions repeatedly?

Key Novelty

Resampling-based 'Truth-in-Advertising' for RL Personalization

Defines 'interestingness' scores that quantify visual patterns of personalization (e.g., consistent action selection in specific states)
Constructs a null hypothesis world where no advantage exists (or no feature-specific advantage exists) using generative models fitted to user data
Resimulates the RL algorithm hundreds of times in this null world to build a reference distribution of 'interestingness' arising solely from stochasticity, then compares the real user's score to this distribution

Evaluation Highlights

Confirmed that 18 out of 63 users in the HeartSteps trial showed personalization patterns (consistently positive advantage) that could be explained by chance/stochasticity alone
Found strong evidence for User 1 that high 'interestingness' (score = 1.0) was statistically unlikely to occur by chance (p-value < 0.002), confirming genuine personalization
Refuted the hypothesis that the 'variation' feature drove personalization for User 2; the observed differential treatment pattern was likely a stochastic artifact (p-value ~ 0.53)

Breakthrough Assessment

7/10

Provides a crucial methodological sanity check for the growing field of RL in digital health. While not a new RL algorithm itself, it addresses a significant evaluation gap in real-world deployments.

⚙️ Technical Details

Problem Definition

Setting: Contextual Bandit / Online RL in a mobile health setting (HeartSteps trial)

Inputs: User context state S_t (location, engagement, step variation, etc.) and availability I_t

Outputs: Binary treatment action A_t (send activity suggestion vs. do nothing)

Pipeline Flow

Data Collection (Real World) -> Calculate Observed Interestingness Score
Generative Model Fitting -> Create Null Model (e.g., zero advantage)
ParaSim Simulator -> Generate B resampled trajectories using Null Model + RL Algorithm
Distribution Construction -> Compare Observed Score vs. Distribution of Simulated Scores

System Modules

RL Algorithm (HeartSteps)

Learns treatment policy via Bayesian linear regression and Thompson Sampling

Model or implementation: Generalized Linear Thompson Sampling with Gaussian prior/posterior

ParaSim

Simulates counterfactual user trajectories under specific null hypotheses

Model or implementation: Simulation environment wrapping the RL agent

Novel Architectural Elements

Assessment pipeline that decouples 'visual interestingness' of RL trajectories from 'true learning' via rigorous resampling

Modeling

Base Model: Bayesian Linear Regression (for Reward Modeling in RL)

Training Method: Online Learning (Thompson Sampling)

Objective Functions:

Purpose: Maximize cumulative log-transformed step counts.

Formally: sum(R_t) over t=1 to T
Purpose: Estimate reward function parameters.

Formally: Gaussian posterior update based on linear reward model R_t = g(S_t)^T alpha + A_t f(S_t)^T beta + epsilon

Training Data:

Data from HeartSteps clinical trial (91 users, ~90 days)
Generative models for simulation fitted to individual user data using regularized least squares

Key Hyperparameters:

prior_mean: Derived from pilot study data
prior_variance: Derived from pilot study data
clipping_probabilities: [0.2, 0.8] (treatment probability constrained to this range)
+ 1 more
lambda: 0.95 (decay rate for dosage variable)

Compute: Not reported in the paper (lightweight linear models, feasible on standard CPU)

Comparison to Prior Work

vs. Regret bounds: Regret is theoretical and assumes a true optimal policy is known or bounded; this method evaluates the specific instance of learning on observed data
vs. Off-policy evaluation: OPE focuses on value estimation; this method focuses on verifying the *cause* of the policy's behavior (learning vs. noise)

Limitations

Relies on the validity of the generative models used for resampling (simulators must reasonably approximate reality)
Computational cost scales with the number of resampling iterations (B) and users
Analysis is post-hoc exploratory; does not guarantee future performance

Reproducibility

Code: https://github.com/Statistical-Reinforcement-Learning-Lab/Personalization-Assessment

Code is publicly available at https://github.com/Statistical-Reinforcement-Learning-Lab/Personalization-Assessment. The paper uses data from the HeartSteps trial, which is a specific clinical dataset.

📊 Experiments & Results

Evaluation Setup

Retrospective analysis of the HeartSteps V1 clinical trial data

Benchmarks:

HeartSteps Clinical Trial (Mobile health intervention for physical activity)

Metrics:

Interestingness Score (Score_int): Fraction of times advantage forecast > 0 (Type 1) or differential advantage > 0 (Type 2)
Number of Interesting Users (#User_int): Count of users exceeding a score threshold
P-value (implied): Fraction of resampled trajectories with scores more extreme than observed
Statistical methodology: Resampling/Permutation-style test using 500 simulated trials per user/question

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HeartSteps	#User_int1 (Count of users with \|Score - 0.5\| >= 0.4)	Distribution centered ~18	18	0
HeartSteps	Score_int1 (Fraction of positive advantage)	~0.5 (Average)	1.0	+0.5
HeartSteps	Score_int2 (Differential advantage by variation)	~0.5 (Average)	0.19	-0.31

Main Takeaways

Visual inspection of RL trajectories is insufficient; stochastic algorithms can produce convincingly 'personalized' patterns purely by chance.
Population-level analysis suggests that while some users were truly personalized, the overall count of 'interesting' users was not statistically distinguishable from a random null model.
The method successfully debunked a specific hypothesis (that the 'variation' feature was driving personalization for User 2), saving researchers from pursuing a false lead in future algorithm design.

📚 Prerequisite Knowledge

Prerequisites

Contextual Bandits / Reinforcement Learning basics
Thompson Sampling
Hypothesis testing / Resampling methods (e.g., bootstrap, permutation tests)

Key Terms

HeartSteps: A mobile health clinical trial optimizing physical activity interventions using an online RL algorithm

Thompson Sampling: An algorithm that selects actions based on the probability that they are optimal, calculated from a posterior distribution

ParaSim: The paper's proposed algorithm for resampling user trajectories by simulating the environment and re-running the RL agent

Score_int: A quantitative metric defined by the authors to measure how 'interesting' or personalized a user's trajectory appears (e.g., fraction of times advantage > 0)

dosage variable: A state feature tracking the decaying accumulation of past treatments to account for habituation or burden

standardized posterior advantage: The algorithm's estimated benefit of taking an action divided by the uncertainty, used to drive action selection probabilities