← Back to Paper List

Unifying On- and Off-Policy Variance Reduction Methods

Olivier Jeunen
aampe
arXiv (2026)
Recommendation P13N RL

📝 Paper Summary

Causal Inference A/B Testing Off-Policy Evaluation
Formally proves that standard online A/B testing estimators are mathematically identical to specific offline counterfactual estimators, establishing a unified view of variance reduction across both domains.
Core Problem
Online (A/B testing) and offline (Off-Policy Evaluation) communities operate in silos with disjoint terminologies and tools, leading to fragmented infrastructure and subtle statistical errors like incorrect degrees-of-freedom corrections.
Why it matters:
  • The artificial divide prevents cross-pollination of variance reduction techniques, slowing progress in both fields
  • Practitioners often use incorrect variance formulas for offline estimators because the connection to standard t-tests is not operationalized
  • High variance in outcome metrics hampers statistical power in both online and offline experiments, making it harder to detect treatment effects
Concrete Example: When calculating the variance of an offline IPS estimator with a learned baseline, practitioners typically divide by N-1. The paper proves this is biased: because the baseline is estimated from data (consuming a degree of freedom), one must divide by N-2—exactly matching the standard online t-test—to obtain an unbiased estimate.
Key Novelty
Formal Equivalence of On- and Off-Policy Estimators
  • Proves the online Difference-in-Means (DiM) estimator is mathematically identical to the offline Inverse Propensity Scoring (IPS) estimator with an optimal additive control variate
  • Demonstrates that online regression adjustments (CUPED, ML-RATE) are structurally equivalent to offline Doubly Robust (DR) estimation when the reward model is action-agnostic
Breakthrough Assessment
7/10
Significant theoretical contribution that cleans up the landscape of causal inference in tech. While not a new algorithm, the unification and degrees-of-freedom correction are highly valuable for correctness in practice.
×