DataMarvin

This post walks through a case study problem I worked on recently — an A/B/C test where interim p-values had been checked on each of the first three days. Working through it gave me a clear opportunity to apply group sequential testing in a realistic setting, and the decision logic turned out to be more nuanced than it first appeared.

The Setup

The experiment had three arms: A as the control, and B and C as treatment variants. Both B and C showed negative coefficients relative to A — meaning both treatments appeared to reduce the outcome of interest. But the two variants told very different stories when you looked at the day-by-day p-values.

Variant B was negative and statistically significant on all three days, with p-values below 0.001 on each look. The coefficient itself was also consistently declining — the estimated effect was getting more negative over time, not stabilizing. Variant C was also negative, but far less decisive: the coefficient moved around across days, and by day 3 the p-value had risen to 0.08 — below the conventional 0.05 threshold but not cleanly significant.

The Problem with Naïve Repeated Looks

The core statistical issue is Type I error inflation. If you check a p-value on days 1, 2, and 3 and reserve the right to stop at any point, you're effectively running multiple tests against the same null hypothesis. Each additional look is another opportunity to cross α = 0.05 by chance — even when there's no true effect. The more looks, the higher the probability of a false positive.

The standard fix is to pre-register an alpha spending plan: allocate the total Type I error budget across the planned interim looks before the experiment starts, so no single look consumes more than its pre-specified share.

My Approach: Group Sequential Testing at Days 1, 3, and 7

Rather than evaluating each day at a flat α = 0.05, I applied a group sequential framework with three pre-specified looks: day 1, day 3, and day 7. The spending schedule I used was:

Day 1: α = 0.001
Day 3: α = 0.010
Day 7: α = 0.040

These sum to 0.051, which is a reasonable approximation of the standard 0.05 budget under an O'Brien–Fleming-style front-loaded schedule. The logic behind the look schedule is practical: days 1 and 3 catch fast-moving effects early, while day 7 ensures coverage across a full week — capturing day-of-week variation that shorter windows would miss.

If the treatment effect genuinely shifts after day 3 and you need more time to be sure, extending to day 10 is reasonable. But if a 7-day window is already pushing the limits of what stakeholders will wait for, the schedule above is a defensible stopping point.

Applying the Framework to B and C

Under this spending plan, Variant B clears the threshold comfortably. Its p-value was below 0.001 on both day 1 and day 3 — well inside the adjusted critical values at each look. The consistently declining coefficient reinforces this: the signal is not only significant but directionally stable and strengthening. I would call B a genuine negative effect and recommend stopping that arm early.

Variant C is a different story. The p-value reached 0.08 on day 3, which fails the adjusted threshold of 0.010 at that look — and the coefficient has been moving around rather than converging. That pattern is more consistent with noise than with a real effect. I would not call C significant at this stage. The right call is to continue to the day 7 look before drawing any conclusion, applying the remaining 0.040 of the alpha budget at that point.

What This Case Illustrates

The value of the sequential framework here isn't just statistical correctness — it's clarity of decision. Without a pre-specified spending plan, both B and C might get lumped together as "trending negative" with no principled way to distinguish them. With the plan in place, one arm has already crossed its threshold and one hasn't. That distinction matters when you're deciding whether to roll back a feature, continue running an experiment, or escalate a finding to product and engineering.

The full technical background — alpha spending functions, information fractions, and the tradeoffs between O'Brien–Fleming and Pocock boundaries — is covered in the companion post: Sequential A/B Testing and Alpha Spending.