少点错误 前天 04:37
P-Values Know When You're Cheating
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了P值在科研中的应用以及P-hacking的问题。尽管P值在报告最终结果时存在缺陷,但它为比较不同研究方法提供了一个有用的启发式工具。文章通过分析P值的内在逻辑,揭示了P值如何惩罚粗糙的研究方法,以及P-hacking中选择性报告结果的问题。作者认为,P-hacking的核心问题在于选择性报告,而非P值框架本身,即使在频率论的框架下,P值也能对P-hacking提出质疑。

🔍 P值作为一种方法比较的工具,能够根据研究方法惩罚不严谨性。当两种研究方法得出相同结果时,P值可以根据方法本身的不同,反映出对结果的信心差异。

🧪 P-hacking的核心在于选择性报告结果,而非P值框架本身的问题。研究者在进行多变量分析时,通过不断测试变量,直到找到“显著”结果并发表,这种行为构成了P-hacking。

📊 P-hacking的概率计算表明,即使在少量变量的实验中,也存在很高的概率找到虚假的结果。例如,在5%的显著性水平下,仅需6个变量,就有超过50%的几率找到虚假结果。

💡 解决P-hacking问题的关键在于改进报告方式,如明确说明测试了多少变量,并采用适当的统计校正方法,例如Bonferroni校正,以确保结果的可靠性。

Published on May 16, 2025 8:34 PM GMT

This post is not "the solution to p-hacking" or even "the problem with p-hacking". But I had some recent insights into thinking behind the p-value approach and wanted to see how they apply to p-hacking as an issue. While p-values are a flawed approach to reporting final results, they do provide a useful heuristic for comparing methodologies.


Basic idea: P-values implicitly punish sloppy methodology.

Suppose there are two different ways of testing a hypothesis that happen to yield the same results. When updating our expectations of future experience we won't see a difference because they have the same results. However, if we use p-values to measure our confidence in the result they will differ relative the methodology used. This is unhelpful for updating expectations, but valuable if you are trying to compare methods.

The key here is that p-values compare all methods against a uniform yardstick, the chance of being fooled under the null hypothesis. If a method is more likely to be tricked by random variation it gets a weaker p-value. 


Consider then a typical example of p-hacking. Hart is a researcher who suspects that in a given situation dependent variable B is correlated with independent variable A. They run an experiment measuring B against A. But at the same time, they measure and record C, D, E, etc. So, if A ~ B ends up at p = 10%, Hart can keep testing variables until they find C ~ F at p = 4.3% and publish this 'statistically significant' result. 

But here is the rub: The method that Hart used included testing variables until they found something significant. Even the flawed reporting method of the p-value approach says has something to say about this. This yet another motivated stopping method. 

Supposing we measure eight variables in total. There are then 28 pair-wise interactions. These do not have 28 separate null hypotheses. When considering Hart's method as a whole, they have one null hypothesis: the combination of all 28 default assumptions. The probability of finding a 'statistically significant' relation somewhere in this dataset is p > 95%^28 = 23.8%. Better than 3/4 times. 

In fact at the 5% level it only takes 6 six variables to have better than 50% odds of finding a spurious result. (This includes only pairwise comparisons. It is even more ridiculous if you include things like 'C mediates the relationship between B and A') At the 1% level this is only 16 variables. 

This is a well understood issue, and proper reporting should include how many variables were available. With 8 variables the standard approach (Bonferroni correction) says we need 0.05/28 = 0.0018 level confidence in any pairwise comparison to be 'statistically significant' overall. 


P-hacking is often presented as a failure of the p-value framework. But the deeper failure is epistemic: the selective reporting of results, regardless of the statistical tool used. And as we see above, even the frequentist approach has much to say in p-hacking's reproach. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

P值 P-hacking 科研方法 统计显著性
相关文章