Evaluation Setup
Retrospective analysis of the HeartSteps V1 clinical trial data
Benchmarks:
- HeartSteps Clinical Trial (Mobile health intervention for physical activity)
Metrics:
- Interestingness Score (Score_int): Fraction of times advantage forecast > 0 (Type 1) or differential advantage > 0 (Type 2)
- Number of Interesting Users (#User_int): Count of users exceeding a score threshold
- P-value (implied): Fraction of resampled trajectories with scores more extreme than observed
- Statistical methodology: Resampling/Permutation-style test using 500 simulated trials per user/question
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| HeartSteps |
#User_int1 (Count of users with |Score - 0.5| >= 0.4) |
Distribution centered ~18 |
18 |
0
|
| HeartSteps |
Score_int1 (Fraction of positive advantage) |
~0.5 (Average) |
1.0 |
+0.5
|
| HeartSteps |
Score_int2 (Differential advantage by variation) |
~0.5 (Average) |
0.19 |
-0.31
|
Main Takeaways
- Visual inspection of RL trajectories is insufficient; stochastic algorithms can produce convincingly 'personalized' patterns purely by chance.
- Population-level analysis suggests that while some users were truly personalized, the overall count of 'interesting' users was not statistically distinguishable from a random null model.
- The method successfully debunked a specific hypothesis (that the 'variation' feature was driving personalization for User 2), saving researchers from pursuing a false lead in future algorithm design.