Published on May 16, 2025 8:34 PM GMT
This post is not "the solution to p-hacking" or even "the problem with p-hacking". But I had some recent insights into thinking behind the p-value approach and wanted to see how they apply to p-hacking as an issue. While p-values are a flawed approach to reporting final results, they do provide a useful heuristic for comparing methodologies.
Basic idea: P-values implicitly punish sloppy methodology.
Suppose there are two different ways of testing a hypothesis that happen to yield the same results. When updating our expectations of future experience we won't see a difference because they have the same results. However, if we use p-values to measure our confidence in the result they will differ relative the methodology used. This is unhelpful for updating expectations, but valuable if you are trying to compare methods.
The key here is that p-values compare all methods against a uniform yardstick, the chance of being fooled under the null hypothesis. If a method is more likely to be tricked by random variation it gets a weaker p-value.
Consider then a typical example of p-hacking. Hart is a researcher who suspects that in a given situation dependent variable B is correlated with independent variable A. They run an experiment measuring B against A. But at the same time, they measure and record C, D, E, etc. So, if A ~ B ends up at p = 10%, Hart can keep testing variables until they find C ~ F at p = 4.3% and publish this 'statistically significant' result.
But here is the rub: The method that Hart used included testing variables until they found something significant. Even the flawed reporting method of the p-value approach says has something to say about this. This yet another motivated stopping method.
Supposing we measure eight variables in total. There are then 28 pair-wise interactions. These do not have 28 separate null hypotheses. When considering Hart's method as a whole, they have one null hypothesis: the combination of all 28 default assumptions. The probability of finding a 'statistically significant' relation somewhere in this dataset is p > 95%^28 = 23.8%. Better than 3/4 times.
In fact at the 5% level it only takes 6 six variables to have better than 50% odds of finding a spurious result. (This includes only pairwise comparisons. It is even more ridiculous if you include things like 'C mediates the relationship between B and A') At the 1% level this is only 16 variables.
This is a well understood issue, and proper reporting should include how many variables were available. With 8 variables the standard approach (Bonferroni correction) says we need 0.05/28 = 0.0018 level confidence in any pairwise comparison to be 'statistically significant' overall.
P-hacking is often presented as a failure of the p-value framework. But the deeper failure is epistemic: the selective reporting of results, regardless of the statistical tool used. And as we see above, even the frequentist approach has much to say in p-hacking's reproach.
Discuss