少点错误 前天 20:52
Statistical takes for mech interp research and beyond
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了在科学研究,特别是机制可解释性(mechanistic interpretability)领域中,数据分析的几个核心原则。作者强调,仅仅依赖p值并不能完全代表研究的可靠性,过小的p值(如p>.01)可能暗示存在p-hacking或样本量不足的问题。同时,文章着重指出了数据独立性的重要性,尤其是在处理层级结构数据时,忽视独立性可能导致分析结果的严重偏差。此外,作者还讨论了效应量(effect size)的意义,指出其解释需结合具体情境,而非孤立看待。这些原则对于提高研究的严谨性和可信度至关重要。

📊 **p值解读需谨慎,警惕“低效”显著性**:文章指出,仅仅达到p < .05的统计显著性并不足以证明研究结果的可靠性。研究表明,p值越小(如p ≤ .005)的研究,其可重复性越高。如果一项研究中大部分p值集中在.01到.05之间,这可能暗示存在p-hacking或样本量不足,其证据强度相对较弱,研究者应倾向于收集更多数据以获得更小的p值(如p < .001)。

🔗 **数据独立性是基石,层级结构易被忽视**:在统计分析和分类器评估中,观察值之间的独立性至关重要。许多研究数据(如机制可解释性中的序列数据、心理学中的受试者内数据)存在层级结构,若忽略这种依赖性,将导致对变异性的低估,从而产生过于乐观的p值或分类准确率。作者以Qwen模型为例,说明了即使是简单的逻辑回归,在忽视token间相似性时也能获得过高的准确率,这凸显了正确处理层级数据的必要性,例如使用group-k-fold交叉验证。

⚖️ **效应量需情境化,勿孤立判断**:效应量衡量的是变量间的关联强度,但其意义需结合具体研究背景来理解。例如,在心理学研究中,年龄与记忆力之间的微小关联(r=-.05)可能在实验室环境中统计显著,但在现实生活中,其“实用意义”可能非常有限。作者建议,在机制可解释性研究中,应考虑不同变量(如SAE latent)聚合后的效应量变化,并培养对数据直觉的理解,以更准确地评估研究发现的重要性。

🧮 **控制混淆变量是挑战,需审慎分析**:在研究两个变量的关系时,可能存在第三个未被考虑的变量同时影响这两个变量,从而导致虚假关联。这一点不仅在相关性和回归分析中适用,在解释分类器结果和寻找“不言自明”的巨大效应时也同样重要。在机制可解释性研究中,尤其是在探索模型内部不同模式之间的关联时,必须仔细考虑并尝试控制潜在的混淆变量,以确保分析结果的有效性。

Published on August 6, 2025 12:45 PM GMT

I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics.

tl;dr

Small p-values are nice.
Unless they're suspiciously small.

Statistical assumptions can be bent.
Except for independence.

Practical significance beats statistical significance.
Although practicality depends on context.

The measure is not the confound.
Sometimes it’s close enough.

Readability often beats rigor.
But fearing rigor means you probably need it.

Simple is better than complex.
Complex is better than wrong.

Complex wrongs are the worst.
But permutation tests can help reveal them.

This post offers advice for frequentist and classifier-based analysis. I try to focus on points that are practical, diverse, and non-obvious. I emphasize relevancy for mechanistic interpretability research, but this post will hopefully be interesting for people working in any area, including non-researchers who just sometimes read scientific papers.[1]

1. If you want to make a claim based on p-values, don’t settle for p = .02

A p-value describes the probability that random data would produce an equally strong or stronger effect than what was observed. P-values are the cornerstone of null hypothesis significance testing, where researchers test a claim by attempting to rule out a null hypothesis, which is typically the assumption that there exists no effect or difference between groups.

1.1. Some history and culture

When frequentist statistics were first developed at the turn of the 20th century, there were competing mindsets: Fisher proposed that p-values should be interpreted on a continuous scale, where smaller p-values correspond to heightened evidence. Neyman and Pearson would later argue for a more rigid decision-theoretic framework, where p-values are compared to a predetermined significance level to make binary accept/reject decisions.

Social science papers today usually adopt the binary approach, treating any finding at p < .05 as “significant.” Although researchers and laypeople may maintain suspicion of, say, p = .049, people will commonly endorse claims like p < .001 being no better than p < .05 if both cross a predetermined significance threshold. I suspect that this thinking stems from: (i) a desire to cleverly believe intuitive things and (ii) vague knowledge that frequentist tests don’t quantify evidence the same way that Bayesian tests do. However, this mindset is wrong when conducting research or just reading papers.

1.2. For all practical purposes…

Strong p-values provide greater evidence, while weak p-values often indicate p-hacking and non-replicability. Among replication attempts in psychology, studies where the main effects were p ≤ .005 replicated 74% of the time, while those between .005 < p < .05 replicated at just 28%. Looking at papers reporting many statistics, ones where more than a third of significant p-values were p > .01 replicated just 33% of the time, whereas papers where fewer than a third of p-values were p > .01 replicated 62% of the time.

The easiest way to predict whether a paper will replicate is by examining its p-values.

You may know to be concerned about p = .049, but p = .02 is also problematic! In a non-p-hacked study with an adequate sample size, you should only expect around a quarter of p-values to be .01 < p < .05. If you’re reading a paper where a majority of p-values are like that, then you should probably treat it as, at best, weak evidence for whatever they are claiming (or more practically, you should avoid spending any further time reading).

True effects produce small p-values. A sentiment among some researchers is that achieving significance at .01 < p < .05 means a study efficiently collected just enough data to cross the threshold of significance. This is wrong. If you are conducting a test that has a >50% chance to yield a statistically significant (p < .05) finding supporting your hypothesis, then a majority of those significant p-values should be p < .01. If you are conducting a study that has an 80% chance of yielding a statistically significant finding supporting your hypothesis, then almost three-quarters of significant p-values will be p < .01. Furthermore, two-fifths should be at p < .001 (!); really strong p-values should be commonplace rather than rare.[2]

This post is, in large part, about mechanistic interpretability (mech interp), and there is little impetus to p-hack in mech interp! If you have a p-value of .01 < p < .05, just go and collect more data. Data collection is usually cheap, so go pursue p < .001.[3]

2. Independence among observations is critical

Statistical independence is foundational when evaluating significance tests or classifier (probe) accuracy. Independence means that knowing one sample's value doesn't provide information about another's.

Independence issues are common: Across many fields, including mech interp, datasets are often hierarchical, where different data points are related to each other in some way.

Ignoring independence can be disastrous: Due to interdependencies, random data can produce microscopic p-values or produce seemingly perfect classifiers via test-set contamination.

2.1. Data is often hierarchical

Hierarchical structures are ubiquitous in interesting datasets. In mech interp, you might look at different residual stream measurements from the same response, different responses from the same prompt, or different prompts from the same task category. In psychology and neuroscience, you will commonly have multiple brain measurements from the same trial, multiple trials from the same participant, and possibly even multiple participants from the same group.

Across these cases, if you ignore hierarchies and treat related measures as independent observations, you will wildly underestimate the extent of variability in the real world. You will think you are generalizing (achieve a tiny p-value or high cross-validation accuracy) when you are actually overfitting.

2.2. Hierarchical interdependencies are large

I’m not trying to convince you to adhere to statistical technicalities. Instead, I want to convey suggestions with major practical implications. If your findings seem way too strong or your p-value is unexpectedly tiny, that’s a strong sign that you are messing up statistical independence somewhere (or you have some artifact in your analysis).

In many neuroscientific studies, statistical effects due to hierarchical interdependencies will tend to be much larger than those from virtually any actual effect of interest.[4] For example, if I show two participants a picture of a cow and a picture of a dead body, then even emotion-processing regions’ brain responses will tend to display much stronger similarities within-subject than within-condition. That is, Alex’s neural response to a cow will be much more similar to Alex’s response to a dead body (think r = .50) than to Bailey’s response to a cow (think r = .02). Ignoring interdependencies can easily invalidate an analysis.

In mech interp, I doubt that effects will be as severe as in human brain data, but my hunch is that great caution regarding independence is still necessary if your data is hierarchical.[5] For both brain and mech-interp analyses, the widespread use of multivariate analytic techniques will magnify the effect of non-independence and create these issues; by “multivariate,” I just mean looking at multiple dimensions, like fitting a classifier based on multiple voxels in an fMRI scan or a probe on the many dimensions of a residual stream vector.

To illustrate this for mech interp, I extracted the residual stream activity of layer 10 at every token position for R1-Qwen-1.5B responding to 50 MMLU questions. The residual stream of neighboring tokens will tend to be highly similar. I performed 50-class classification, while treating each token’s residual-stream vector as an example and pooling every example into one big dataset with over 10,000 examples. Using 5-fold cross-validation, a simple logistic regression classifier can achieve over 90% accuracy (!) (chance accuracy would be just ~2%).[6] 

If I ignored these interdependencies, it could be painful. Suppose I wanted to test whether a model represents its uncertainty at each token position, so I fit a classifier based on the token-level residual stream and attempted to predict whether the model’s final response was correct or incorrect; this hypothetical test presumes that even when a model eventually gives an incorrect response, it would be somewhat aware it may be incorrect. If I followed the above procedures, using every token’s data as an example and performing basic 5-fold cross-validation, the classifier might just learn the 50 responses’ unique intricacies rather than generalizing at all.

One way to make sure classifier accuracy reflects generalization would be to instead use group-k-fold cross-validation. Note, while this is a correct evaluation, it doesn’t influence training, where the model could still learn response-specific patterns. Including more data from the same response could very well lead to a worse classifier. I encountered a scenario like this recently, and because I only needed a quick-and-easy solution, I decided to simply take one data point from each response and discard the rest.

2.3. Categories allow independence to be easily achieved

It’s often pretty straightforward to convert your hierarchical data into largely independent observations when dealing with categorical/binary independent variables:

2.4. With continuous variables, things are hairier

When working with continuous variables, handling hierarchical structures correctly becomes less obvious. Further, compared to the ubiquity of papers properly dealing with hierarchies in categorical variables (often just implicitly), examples of papers doing so with continuous variables are rarer.

In our paper mentioned above, we performed an analysis where we computed two matrices describing each chain of thought: (i) a matrix describing the causal relationship between sentences based on changes in token logits, and (ii) a matrix describing the causal relationship between sentences based on legible text. A response’s two matrices were the same size, with each matrix containing 1,000-40,000 elements. We hypothesized that the two matrices would be correlated, and we wanted to draw conclusions that aggregate data across our 20 responses.[7]

This data is hierarchical. Each response was associated with numerous data points. We could try a naive approach:

This approach would be inappropriate, and the correlation would generate a microscopic p-value because it treats the data as having >100k independent degrees of freedom, drastically inflating the apparent strength of the statistical significance.

How might this strategy go wrong? Suppose that for 11/20 responses, the two matrices are indeed positively related, whereas for 9/20 responses, the matrices are negatively correlated. In this case, it is clear that finding an overall positive relationship could be spurious, as random chance could’ve easily flipped the scenario to 9 positive and 11 negative responses. However, when the elements are pooled to form a correlation, we treat this akin to having 110k points supporting a positive association while 90k points support a negative one, which would be decisive evidence favoring the positive link.[8]

We wanted to test the association between each response’s two matrices while respecting the data’s hierarchical properties during aggregation across responses. Hence, we simply computed the element-by-element correlation for each response and reported the average correlation along with the number of responses showing a positive correlation (18/20).[9]

In general, averaging data to achieve independent observations and performing simple statistical tests is a thoughtful and clear solution.

3. Effect sizes are only meaningful in context

An effect size is a measure of how strongly two variables are related that is ostensibly independent of sample size (e.g., a correlation coefficient or a Cohen’s d). When fitting probes, it is useful to also think of classification accuracy as an effect size.

The concept of an effect size underlies the phrase, “Statistical significance is not the same as practical significance.” From what I can tell, this quip became more popular alongside the rise of the scientific replication crisis, which generally led to greater statistical skepticism among academics and tuned-in laypeople. This quip is generally used to encourage dismissing results with small effect sizes.

This view is sometimes productive. Everything is Correlated once you have enough data, and, compared to in the social sciences where participant recruitment can be costly, achieving large sample sizes for mech interp research can be easy. In turn, if you are interested in evaluating the associations between different measures, you may find plenty of tiny effects. You might test whether the strength of one SAE latent at a given token position correlates with the strength of a different SAE latent 32 tokens later, and find a statistically significant but small association (e.g., a mean correlation of r = .1). This type of weak link might not be too informative.

3.1. Contextualizing the quip

However, it is important to consider the context of an effect size and what other factors are at play that are tangential to the mechanisms you are interested in

Imagine a study where researchers want to assess whether elementary students fed a larger lunch tend to perform better in standardized exams. The researchers look at 40 schools, each containing 100 students. Every student at a given school receives the same lunch, so the researchers randomly assign each school to a different condition (e.g., normal lunch, +100 calorie lunch, +200 calorie lunch).

To analyze the link between extra calories and exam performance, the researchers may measure a correlation on a student-by-student basis (i.e., 4,000 data points). This would likely yield a minuscule correlation (e.g., r = .05). Alternatively, the researchers might average exam performance by school and compute the correlation on a school-by-school basis (i.e., 40 data points), which would yield a stronger correlation (e.g., r = .20). This new correlation is larger, but for reasons orthogonal to the nutrition-performance relationship. A correlation could alternatively be buffed if (i) the exam is changed to a more reliable measure of performance by adding more questions, or (ii) the underlying manipulation could be magnified to +150 and +300 calorie lunches.

In studies less attached to real-world outcomes, these factors make it very difficult to draw claims about “practical” significance based on an effect size. For instance, one psychology study might show a correlation of r = -.20 between participant age and performance on a memory test, while another study shows a correlation of r = -.3 between age and performance on a decision-making test. Laboratory effect sizes will say little about the degree older adults should expect their memory or decision-making to deteriorate. Meaningfully contrasting the effect sizes across different studies would be difficult, let alone evaluating the extent older adults should actively worry about this in their day-to-day.

3.2. Back to mech interp

You may discover yourself in a similar mech interp scenario. You may find stronger correlations when using SAE latents rather than focusing on the activations of neurons one at a time. Likewise, you may find stronger links to SAE latent data pooled across many tokens rather than SAE latents at singular tokens. These same principles apply when fitting classifiers. A probe that achieves moderate accuracy when classifying data based on a single token or sentence may be more impressive than a classifier operating using data aggregated across a whole response.

It’s hard to come up with a clean takeaway message here. Think and work to build intuition about data so that you can better assess how impressive different findings are

4. Perfectly controlling confounds is hard

Sometimes, you’ll be interested in two variables’ relationship but a third non-targeted variable may be linked to both and contribute to their association in an undesired way. This concept is usually thought of in terms of correlations and regressions. However, it is also worth considering confounds when drawing conclusions from classifiers or when searching for effects so huge that statistics are seemingly unnecessary.

For mech interp, confounds are important to think about when looking at associations between different patterns within a model’s internals, as confounds can be very large. Here are a few examples of large realistic confounds:

4.1. Challenging confounds

When controlling for major confounds, you should think about reliability and the relationship structure (e.g., linearity). This is particularly important when a confound’s impact is large.

For the purposes of controlling a confound, you can think of reliability as the extent a given measure captures an underlying construct. You may be conducting a study where you want to control for participant intelligence. To do this, it is common to use a participant’s educational attainment (years of school completed) as a proxy for intelligence. However, the measure is not the construct. If educational attainment explains 25% of variability in intelligence (r = .5, R2 = .25), then when you control for education, you’ll still leave a majority of intelligence effects intact.

For mech interp, your measures may be highly reliable, but controlling for their relationship with other variables may be difficult. The typical strategy for controlling a confound is to include it as a variable in a linear regression. However, this assumes the confound’s relationship with other variables is linear. Yet, most relationships are not linear. The position and proximity effects mentioned above are certainly not. If you simply control for them via a linear regression rather than correlation, you will fail to actually scrub out their effects from the data.[10][11]

4.2. Bright possibilities

Confounds being hard to perfectly control doesn’t mean everything is doomed nor do you always need to implement obscure methods for controlling confounds:

Conversely, if your lazy control leads to your relationship of interest dropping in size by, say, 33% or 50%, then that is a strong sign that your relationship might truly be entirely confounded, and a more rigorous method could find that. In that case, good luck… maybe try to reorganize your experiment or hypothesis to avoid the confound? Alternatively, you can maybe get an LLM to help you find a statistical approach that is more thorough, although you might always live in fear that you aren’t fully scrubbing out a confound.

5. Simplicity is valuable, and you can usually ignore statistical tests’ assumptions

Most statistical tests come with many assumptions. Pearson correlations technically assume: (1) continuous variables, (2) linear relationships, (3) bivariate normality, meaning that the distribution forms an elliptical cloud, (4) homoscedasticity, meaning that the variance of each variable is stable as the other variable changes, and (5) no extreme outliers. Evaluation of statistical significance more generally assumes (6) independence among observations.

In practice, you can violate most of these without practical consequences, except for the importance of independence (discussed above) and considering outliers (which can entirely determine your result). Aside from these two, addressing violations with more obscure methods will generally just lead to your analyses becoming harder to read.[12] 

These claims are not just me being a heterodox cowboy! The general claims have been  Peer Reviewed™.

5.1. Readability matters

Your goal should be to communicate something to an audience. Weird statistics can be an unnecessary obstacle, decreasing accessibility and reading speed. Machine learning researchers, SWEs, and others might have strong technical and mathematical skills but not necessarily broad familiarity with statistics.

If you’re going to be complex, being complex in an expected way (using a method that readers or LLMs can look into) is better than being complex in an original way

5.2. Slight complexity that I still encourage

One easy way to make sure that a (Pearson) correlation result doesn’t come down to an outlier is to simply also test a Spearman rank correlation. A Spearman correlation simply involves transforming every value into a ranking, separately for each of your two variables, and then measuring the Pearson correlation with respect to the rankings. For analysis, I usually test Spearman correlations by default. However, when communicating results, I will generally report a normal Pearson correlation for simplicity if the Pearson correlation’s value is close to it, given the slight clarity advantages.

If you have a weird hunch from your data that you should be afraid of something (e.g., non-normality is important), then you definitely should explore the data more to get rid of that fear! This is good even if you don’t eventually end up reporting complex analyses in your results. Stuff complex statistics in an appendix if you feel they’re important.

If you’re doing weird analyses, then you should probably learn about permutation tests…

6. Complexity breeds statistical demons, but careful permutation testing can slay them

Lots of papers introduce elaborate boutique analyses to test novel hypotheses. Generally, simplicity is better, but this is not always possible.

It is not too hard to develop analyses that, at the surface level, appear to generate findings supporting a hypothesis while, in reality, yield pure mathematical artifacts.

In fMRI neuroscience, novel analyses might mix correlations, transformations, normalizations, classifications, and aggregations in new ways. Each new operation adds to the possibility of mathematical artifacts and makes them harder to spot. It’s hard to come up with a concise mech interp example of this, but here is an attempt to generalize one from neuroscience:

Can you spot why this opens the door to artifacts? Answer in footnote.[13] 

In many cases, it will be infeasible for readers or even technically competent and careful reviewers to spot these issues. In some cases, even you might not have spotted them. When you’re fitting classifiers, you need to be wary of exotic ways for test set contamination to happen.

Permutation testing is the gold standard for checking whether your weird analysis has a problem. This applies when either evaluating statistical significance or fitting a classifier.

Permutation testing involves essentially shuffling your dependent variable early on and then seeing what results would happen in that case. For the physics and chemistry example, you would want to: (i) take all of the physics/chemistry labels and assign them to examples randomly, (ii) redo the analysis, (iii) repeatedly do this >100 times to get a sense for what your result distribution would be like if your dependent variable was random. For example, if your original finding only corresponds to the top 20% in the distribution, your p-value is only p = .20.[14] 

Permutation testing is really nice for sanity checking. Other uses include:

7. Conclusion

That’s all I’ve got. Feel free to read my (hopefully friendly) opening poem again for a concluding summary.

  1. ^

     Mentioning frequentism, some readers may react: Wait, aren’t Bayesian methods just better? I say: No! Unnecessary Bayesian analyses will damage clarity, and in practice, many papers end up just treating Bayesian methods like frequentist ones anyhow (e.g., using Bayes Factor thresholds of >10 for "weak" or >100 for "strong" evidence). There are cases where Bayesian techniques are helpful, but from what I can tell, they are a small minority, and frequentist statistics are great for communicating self-contained research outputs.

  2. ^

     These “>50%” and “80%” percentages refer to  “statistical power”. Statistical power is the likelihood of finding evidence significantly supporting your hypothesis, given your hypothesized effect size, your collected sample size, and your significance threshold. For example, if I hypothesize that two variables are correlated at r = .3, then if I collect 40 data points, then I have an X% chance of finding a significant (p < .05) correlation (i.e., I have X% power). Power calculations are only as trustworthy as the input variables, but social studies studies are generally recommended to have at least 80% power. This footnote is just meant to be fun facts, and you almost certainly don’t need to worry about formal calculations of power for mech interp work, although you may find this concept to be useful to think about informally. Of note, if your final p-value is .01 < p < .05 that means that your experiment was likely underpowered and you got lucky in finding an effect… so you should probably go collect more data.

  3. ^

     It is generally considered a “questionable research practice” to test your hypothesis after each data point has been collected and then cease collection once you’ve hit statistical significance. It is true that if your significance threshold is p < .05 for a one-sample t-tests, you are willing to collect up to 100 data points, and you test for significance after each point, then you will achieve p < .05 about 12% of the time rather than the desired 5% of the time. However, if you set the significance threshold lower, then you can maintain a suitable false positive rate. For example, if you repeatedly test your hypothesis and keep collecting data until you reach p < .01, then over 100 data points your overall false positive rate will remain only about 5% and thus you’ve effectively achieved p < .05. Some papers have talked about the math behind this. On average, repeated hypothesis testing in combination with an optimally stricter threshold (e.g., p < .01) requires less participants on average to find a significant effect than simply collecting a predetermined sample-size and only testing the hypothesis once at p < .05. Much like the prior footnote, this info should be treated as mostly fun facts and hopefully this just helps you build intuition. Main takeaway: Don’t feel bad about repeatedly testing a hypothesis after each data point has been collected, so long as you use a strict threshold.

  4. ^

     That is, the proportion of variance attributable to between-subject differences will be far greater than the variance attributable to within-subject differences. The effective sample size - the N for which the central limit theorem applies - thus predominantly depends on the number of subjects.

  5. ^

     This is less of a problem if you have a massive sample size (e.g., you have a hundred observations per group but millions of groups).

  6. ^

     This number is so high that I’m even a bit scared I may have an error somewhere… a random forest classifier hits 65% accuracy, which feels more believable and still conveys the same point.

  7. ^

     These matrices were computationally expensive to produce or else we would’ve had much more data.

  8. ^

     Another downside to this mass-pooling strategy is that it may lead to finding effects based on across-response differences, which will usually be less interesting. That is, the different responses may vary in the average value of the two matrices (i.e., uniform shifts updward/downward). This would drive the mass-pooling correlation. Even if you were interested in this, you would be better off simply averaging the data by response then computing the correlation among the twenty averages.

  9. ^

     The technically most proper way to do this is with a multilevel linear regression. Multilevel regressions are becoming somewhat common in psychology and neuroscience research. However, you probably don’t want to do this for mech interp research because most readers will not have heard of them. The strategy we describe here is technically inappropriate for some statistical reasons, but it should be acceptable in practice.

  10. ^

     Lots of times, there are intuitive non-linear tricks you can do. For instance, in the example above where you are comparing same-SAE-latent tokens vs. random-tokens, you can do a paired analysis where for each token-pair in the former group you select one random-token-pair that is the same distance apart.

  11. ^

     Dynomight has a post touching on related issues: “Why controlling for a variable doesn't (usually) work”... although see this Contra. Both posts are consistent with the message here about how this can be a particularly big deal when confounds have massive impacts rather than being subtle.

  12. ^

     If a p-value is p = .04 vs. p = .06 based on slight variations in how your correlation or t-test is performed, then your evidence is weak anyhow and you should go collect more data or treat the finding as fake.

  13. ^

     The z-scoring will cause the similarity measure to be shifted downwards - e.g., if each token’s residual stream data was simulated randomly, then the mean similarity between tokens would be below zero. This is because one token’s vector being high in one dimension now forces other tokens’ vectors to be lower in that dimension. Furthermore, this effect will differentially impact the physics or chemistry similarity measurements depending on how many examples you have in each class (e.g., 40 physics problems but 60 chemistry ones). At least in neuroscience, this type of artifact can easily be larger than any hypothesized effect.

  14. ^

     For hierarchical data, permutation testing can be done while maintaining the hierarchical structure (e.g., only shuffling data within a given hierarchy group), which is useful for evaluating whether you are somehow violating independence assumptions.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

p值 数据独立性 效应量 统计分析 机制可解释性
相关文章