P-Hacking in Startups

Speed kills rigor. In startups, the pressure to ship fast pushes teams to report anything that looks like an improvement. That’s how p-hacking happens. This piece breaks down three common cases—and how to avoid them.

Example 01: Multiple comparisons without correction

Imagine you're a product manager trying to optimize your website’s dashboard. Your goal is to increase user signups. Your team designs four different layouts: A, B, C, and D.

You run an A/B/n test. Users are randomly assigned to one of the four layouts and you track their activity. Your hypothesis is: layout influences signup behavior.

You plan ship the winner if the p-value for one of the layout choices falls below the conventional threshold of 0.05.

Then you check the results:

Option B looks best. p = 0.041. It floats to the top as if inviting action. The team is satisfied and ships it.

But the logic beneath the 0.05 cutoff is more fragile than it appears. That threshold assumes you’re testing a single variant. But you tested four. That alone increases the odds of a false positive.

Let’s look at what that actually means.

Setting a p-value threshold of 0.05 is equivalent to saying: "I’m willing to accept a 5% chance of shipping something that only looked good by chance."

... continue reading