What’s wrong with early stopping in hypothesis testing? R simulations show under-coverage. \(\tau\) (confusing since \(\tau\) is also ATE) is a stopping time.
\(X_\tau\) is random because \(X_n\) is random and\(\tau\) is random. The core issue:
Power calculations \(\Rightarrow\) fix sample size beforehand
Just guess.
Simulation: Repeated \(z\)-Tests Under \(H_0\)
Setup. Under \(H_0\), draw \(X_i \overset{\text{iid}}{\sim} N(0,1)\) and compute a running \(z\)-test after each new observation. Stop and reject as soon as \(|Z_t| > z_{0.975} = 1.96\). Even though \(H_0\) is true, this “peek-and-stop” strategy rejects far more than 5% of the time.
The peek-and-stop strategy rejects \(H_0\) roughly 41% of the time despite \(\alpha = 0.05\) — a large inflation of Type I error. This gets worse as n_max grows: given infinite peeks you will eventually reject with probability 1 under the null.