少点错误 前天 02:58
Statistical suggestions for mech interp research and beyond
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文为进行科学研究,特别是机械可解释性(mechanistic interpretability)领域的研究者们,提供了实用的统计分析建议。文章强调了p值的重要性,指出过小的p值可能隐藏问题,并建议追求更小的p值(如p<0.01甚至p<0.001)以获得更可靠的结果。同时,作者深入探讨了数据独立性的关键作用,警告忽视数据层级结构可能导致结果的严重偏差,并提供了处理此类数据的策略。此外,文章还讨论了效应量在不同情境下的意义,强调了情境化解读的重要性,并提醒研究者在分析中要警惕混淆变量的影响,以确保研究结论的有效性和可靠性。

📊 **P值的解读与实践:** 文章指出,P值应被视为衡量证据强弱的连续尺度,而非简单的二元判断(显著或不显著)。研究者应追求更小的P值(如p<0.01甚至p<0.001),因为在充分收集数据且无p-hacking的情况下,小的P值应是普遍现象。文章引用心理学复制研究的数据表明,p≤0.005的效应复制率为74%,而0.005

🔗 **数据独立性的关键性:** 作者强调,统计独立性是评估显著性检验和分类器准确率的基础。在机械可解释性等领域,数据常呈现层级结构(如同一提示的不同token、同一被试的不同试验),忽略这种层级关系并将相关数据视为独立观测值,会严重低估变异性,导致结果虚高。作者通过例子说明,忽视层级依赖性可能导致分类器准确率虚高,并建议使用如group-k-fold交叉验证等方法来确保评估的有效性,或在处理类别变量时通过平均等方式获得独立观测值。

⚖️ **效应量的情境化解读:** 文章认为,效应量(如相关系数、Cohen's d)的意义并非独立于具体情境,需要结合研究背景进行解读。例如,在机械可解释性研究中,虽然可能发现微小的显著效应,但这些效应在实际应用中的意义需要通过更深入的分析和直觉来判断。作者通过教育研究的例子说明,效应量的大小可能受到数据聚合方式(如按学生还是按学校聚合)的影响,因此不能简单地将效应量视为“实际意义”的直接衡量标准,而应深入理解其背后的数据结构和研究设计。

⚠️ **警惕混淆变量:** 文章提醒研究者,在分析两个变量的关系时,必须考虑可能同时影响这两者的第三方变量(混淆变量)。混淆变量的存在会扭曲变量间的真实关系,可能导致不正确的结论。这不仅适用于传统的回归分析,也适用于基于分类器的分析以及寻找“一眼就能看出”的巨大效应的研究。在机械可解释性研究中,尤其需要注意模型内部不同模式之间的关联是否受到其他未考虑因素的影响。

Published on August 6, 2025 12:45 PM GMT

I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics.

tl;dr

Small p-values are nice.
Unless they're suspiciously small.

Statistical assumptions can be bent.
Except for independence.

Practical significance beats statistical significance.
Although practicality depends on context.

The measure is not the confound.
Sometimes it’s close enough.

Readability often beats rigor.
But fearing rigor means you probably need it.

Simple is better than complex.
Complex is better than wrong.

Complex wrongs are the worst.
But permutation tests can help reveal them.

This post offers advice for frequentist and classifier-based analysis. I try to focus on points that are practical, diverse, and non-obvious. I emphasize relevancy for mechanistic interpretability research, but this post will hopefully be interesting for people working in any area, including non-researchers who just sometimes read scientific papers.[1]

1. If you want to make a claim based on p-values, don’t settle for p = .02

A p-value describes the probability that random data would produce an equally strong or stronger effect than what was observed. P-values are the cornerstone of null hypothesis significance testing, where researchers test a claim by attempting to rule out a null hypothesis, which is typically the assumption that there exists no effect or difference between groups.

1.1. Some history and culture

When frequentist statistics were first developed at the turn of the 20th century, there were competing mindsets: Fisher proposed that p-values should be interpreted on a continuous scale, where smaller p-values correspond to heightened evidence. Neyman and Pearson would later argue for a more rigid decision-theoretic framework, where p-values are compared to a predetermined significance level to make binary accept/reject decisions.

Social science papers today usually adopt the binary approach, treating any finding at p < .05 as “significant.” Although researchers and laypeople may maintain suspicion of, say, p = .049, people will commonly endorse claims like p < .001 being no better than p < .05 if both cross a predetermined significance threshold. I suspect that this thinking stems from: (i) a desire to cleverly believe intuitive things and (ii) vague knowledge that frequentist tests don’t quantify evidence the same way that Bayesian tests do. However, this mindset is wrong when conducting research or just reading papers.

1.2. For all practical purposes…

Strong p-values provide greater evidence, while weak p-values often indicate p-hacking and non-replicability. Among replication attempts in psychology, studies where the main effects were p ≤ .005 replicated 74% of the time, while those between .005 < p < .05 replicated at just 28%. Looking at papers reporting many statistics, ones where more than a third of significant p-values were p > .01 replicated just 33% of the time, whereas papers where fewer than a third of p-values were p > .01 replicated 62% of the time.

The easiest way to predict whether a paper will replicate is by examining its p-values.

You may know to be concerned about p = .049, but p = .02 is also problematic! In a non-p-hacked study with an adequate sample size, you should only expect around a quarter of p-values to be .01 < p < .05. If you’re reading a paper where a majority of p-values are like that, then you should probably treat it as, at best, weak evidence for whatever they are claiming (or more practically, you should avoid spending any further time reading).

True effects produce small p-values. A sentiment among some researchers is that achieving significance at .01 < p < .05 means a study efficiently collected just enough data to cross the threshold of significance. This is wrong. If you are conducting a test that has a >50% chance to yield a statistically significant (p < .05) finding supporting your hypothesis, then a majority of those significant p-values should be p < .01. If you are conducting a study that has an 80% chance of yielding a statistically significant finding supporting your hypothesis, then almost three-quarters of significant p-values will be p < .01. Furthermore, two-fifths should be at p < .001 (!); really strong p-values should be commonplace rather than rare.[2]

This post is, in large part, about mechanistic interpretability (mech interp), and there is little impetus to p-hack in mech interp! If you have a p-value of .01 < p < .05, just go and collect more data. Data collection is usually cheap, so go pursue p < .001.[3]

2. Independence among observations is critical

Statistical independence is foundational when evaluating significance tests or classifier (probe) accuracy. Independence means that knowing one sample's value doesn't provide information about another's.

Independence issues are common: Across many fields, including mech interp, datasets are often hierarchical, where different data points are related to each other in some way.

Ignoring independence can be disastrous: Due to interdependencies, random data can produce microscopic p-values or produce seemingly perfect classifiers via test-set contamination.

2.1. Data is often hierarchical

Hierarchical structures are ubiquitous in interesting datasets. In mech interp, you might look at different residual stream measurements from the same response, different responses from the same prompt, or different prompts from the same task category. In psychology and neuroscience, you will commonly have multiple brain measurements from the same trial, multiple trials from the same participant, and possibly even multiple participants from the same group.

Across these cases, if you ignore hierarchies and treat related measures as independent observations, you will wildly underestimate the extent of variability in the real world. You will think you are generalizing (achieve a tiny p-value or high cross-validation accuracy) when you are actually overfitting.

2.2. Hierarchical interdependencies are large

I’m not trying to convince you to adhere to statistical technicalities. Instead, I want to convey suggestions with major practical implications. If your findings seem way too strong or your p-value is unexpectedly tiny, that’s a strong sign that you are messing up statistical independence somewhere (or you have some artifact in your analysis).

In many neuroscientific studies, statistical effects due to hierarchical interdependencies will tend to be much larger than those from virtually any actual effect of interest.[4] For example, if I show two participants a picture of a cow and a picture of a dead body, then even emotion-processing regions’ brain responses will tend to display much stronger similarities within-subject than within-condition. That is, Alex’s neural response to a cow will be much more similar to Alex’s response to a dead body (think r = .50) than to Bailey’s response to a cow (think r = .02). Ignoring interdependencies can easily invalidate an analysis.

In mech interp, I doubt that effects will be as severe as in human brain data, but my hunch is that great caution regarding independence is still necessary if your data is hierarchical.[5] For both brain and mech-interp analyses, the widespread use of multivariate analytic techniques will magnify the effect of non-independence and create these issues; by “multivariate,” I just mean looking at multiple dimensions, like fitting a classifier based on multiple voxels in an fMRI scan or a probe on the many dimensions of a residual stream vector.

To illustrate this for mech interp, I extracted the residual stream activity of layer 10 at every token position for R1-Qwen-1.5B responding to 50 MMLU questions. The residual stream of neighboring tokens will tend to be highly similar. I performed 50-class classification, while treating each token’s residual-stream vector as an example and pooling every example into one big dataset with over 10,000 examples. Using 5-fold cross-validation, a simple logistic regression classifier can achieve over 90% accuracy (!) (chance accuracy would be just ~2%).[6] 

If I ignored these interdependencies, it could be painful. Suppose I wanted to test whether a model represents its uncertainty at each token position, so I fit a classifier based on the token-level residual stream and attempted to predict whether the model’s final response was correct or incorrect; this hypothetical test presumes that even when a model eventually gives an incorrect response, it would be somewhat aware it may be incorrect. If I followed the above procedures, using every token’s data as an example and performing basic 5-fold cross-validation, the classifier might just learn the 50 responses’ unique intricacies rather than generalizing at all.

One way to make sure classifier accuracy reflects generalization would be to instead use group-k-fold cross-validation. Note, while this is a correct evaluation, it doesn’t influence training, where the model could still learn response-specific patterns. Including more data from the same response could very well lead to a worse classifier. I encountered a scenario like this recently, and because I only needed a quick-and-easy solution, I decided to simply take one data point from each response and discard the rest.

2.3. Categories allow independence to be easily achieved

It’s often pretty straightforward to convert your hierarchical data into largely independent observations when dealing with categorical/binary independent variables:

2.4. With continuous variables, things are hairier

When working with continuous variables, handling hierarchical structures correctly becomes less obvious. Further, compared to the ubiquity of papers properly dealing with hierarchies in categorical variables (often just implicitly), examples of papers doing so with continuous variables are rarer.

In our paper mentioned above, we performed an analysis where we computed two matrices describing each chain of thought: (i) a matrix describing the causal relationship between sentences based on changes in token logits, and (ii) a matrix describing the causal relationship between sentences based on legible text. A response’s two matrices were the same size, with each matrix containing 1,000-40,000 elements. We hypothesized that the two matrices would be correlated, and we wanted to draw conclusions that aggregate data across our 20 responses.[7]

This data is hierarchical. Each response was associated with numerous data points. We could try a naive approach:

This approach would be inappropriate, and the correlation would generate a microscopic p-value because it treats the data as having >100k independent degrees of freedom, drastically inflating the apparent strength of the statistical significance.

How might this strategy go wrong? Suppose that for 11/20 responses, the two matrices are indeed positively related, whereas for 9/20 responses, the matrices are negatively correlated. In this case, it is clear that finding an overall positive relationship could be spurious, as random chance could’ve easily flipped the scenario to 9 positive and 11 negative responses. However, when the elements are pooled to form a correlation, we treat this akin to having 110k points supporting a positive association while 90k points support a negative one, which would be decisive evidence favoring the positive link.[8]

We wanted to test the association between each response’s two matrices while respecting the data’s hierarchical properties during aggregation across responses. Hence, we simply computed the element-by-element correlation for each response and reported the average correlation along with the number of responses showing a positive correlation (18/20).[9]

In general, averaging data to achieve independent observations and performing simple statistical tests is a thoughtful and clear solution.

3. Effect sizes are only meaningful in context

An effect size is a measure of how strongly two variables are related that is ostensibly independent of sample size (e.g., a correlation coefficient or a Cohen’s d). When fitting probes, it is useful to also think of classification accuracy as an effect size.

The concept of an effect size underlies the phrase, “Statistical significance is not the same as practical significance.” From what I can tell, this quip became more popular alongside the rise of the scientific replication crisis, which generally led to greater statistical skepticism among academics and tuned-in laypeople. This quip is generally used to encourage dismissing results with small effect sizes.

This view is sometimes productive. Everything is Correlated once you have enough data, and, compared to in the social sciences where participant recruitment can be costly, achieving large sample sizes for mech interp research can be easy. In turn, if you are interested in evaluating the associations between different measures, you may find plenty of tiny effects. You might test whether the strength of one SAE latent at a given token position correlates with the strength of a different SAE latent 32 tokens later, and find a statistically significant but small association (e.g., a mean correlation of r = .1). This type of weak link might not be too informative.

3.1. Contextualizing the quip

However, it is important to consider the context of an effect size and what other factors are at play that are tangential to the mechanisms you are interested in

Imagine a study where researchers want to assess whether elementary students fed a larger lunch tend to perform better in standardized exams. The researchers look at 40 schools, each containing 100 students. Every student at a given school receives the same lunch, so the researchers randomly assign each school to a different condition (e.g., normal lunch, +100 calorie lunch, +200 calorie lunch).

To analyze the link between extra calories and exam performance, the researchers may measure a correlation on a student-by-student basis (i.e., 4,000 data points). This would likely yield a minuscule correlation (e.g., r = .05). Alternatively, the researchers might average exam performance by school and compute the correlation on a school-by-school basis (i.e., 40 data points), which would yield a stronger correlation (e.g., r = .20). This new correlation is larger, but for reasons orthogonal to the nutrition-performance relationship. A correlation could alternatively be buffed if (i) the exam is changed to a more reliable measure of performance by adding more questions, or (ii) the underlying manipulation could be magnified to +150 and +300 calorie lunches.

In studies less attached to real-world outcomes, these factors make it very difficult to draw claims about “practical” significance based on an effect size. For instance, one psychology study might show a correlation of r = -.20 between participant age and performance on a memory test, while another study shows a correlation of r = -.3 between age and performance on a decision-making test. Laboratory effect sizes will say little about the degree older adults should expect their memory or decision-making to deteriorate. Meaningfully contrasting the effect sizes across different studies would be difficult, let alone evaluating the extent older adults should actively worry about this in their day-to-day.

3.2. Back to mech interp

You may discover yourself in a similar mech interp scenario. You may find stronger correlations when using SAE latents rather than focusing on the activations of neurons one at a time. Likewise, you may find stronger links to SAE latent data pooled across many tokens rather than SAE latents at singular tokens. These same principles apply when fitting classifiers. A probe that achieves moderate accuracy when classifying data based on a single token or sentence may be more impressive than a classifier operating using data aggregated across a whole response.

It’s hard to come up with a clean takeaway message here. Think and work to build intuition about data so that you can better assess how impressive different findings are

4. Perfectly controlling confounds is hard

Sometimes, you’ll be interested in two variables’ relationship but a third non-targeted variable may be linked to both and contribute to their association in an undesired way. This concept is usually thought of in terms of correlations and regressions. However, it is also worth considering confounds when drawing conclusions from classifiers or when searching for effects so huge that statistics are seemingly unnecessary.

For mech interp, confounds are important to think about when looking at associations between different patterns within a model’s internals, as confounds can be very large. Here are a few examples of large realistic confounds:

4.1. Challenging confounds

When controlling for major confounds, you should think about reliability and the relationship structure (e.g., linearity). This is particularly important when a confound’s impact is large.

For the purposes of controlling a confound, you can think of reliability as the extent a given measure captures an underlying construct. You may be conducting a study where you want to control for participant intelligence. To do this, it is common to use a participant’s educational attainment (years of school completed) as a proxy for intelligence. However, the measure is not the construct. If educational attainment explains 25% of variability in intelligence (r = .5, R2 = .25), then when you control for education, you’ll still leave a majority of intelligence effects intact.

For mech interp, your measures may be highly reliable, but controlling for their relationship with other variables may be difficult. The typical strategy for controlling a confound is to include it as a variable in a linear regression. However, this assumes the confound’s relationship with other variables is linear. Yet, most relationships are not linear. The position and proximity effects mentioned above are certainly not. If you simply control for them via a linear regression rather than correlation, you will fail to actually scrub out their effects from the data.[10][11]

4.2. Bright possibilities

Confounds being hard to perfectly control doesn’t mean everything is doomed nor do you always need to implement obscure methods for controlling confounds:

Conversely, if your lazy control leads to your relationship of interest dropping in size by, say, 33% or 50%, then that is a strong sign that your relationship might truly be entirely confounded, and a more rigorous method could find that. In that case, good luck… maybe try to reorganize your experiment or hypothesis to avoid the confound? Alternatively, you can maybe get an LLM to help you find a statistical approach that is more thorough, although you might always live in fear that you aren’t fully scrubbing out a confound.

5. Simplicity is valuable, and you can usually ignore statistical tests’ assumptions

Most statistical tests come with many assumptions. Pearson correlations technically assume: (1) continuous variables, (2) linear relationships, (3) bivariate normality, meaning that the distribution forms an elliptical cloud, (4) homoscedasticity, meaning that the variance of each variable is stable as the other variable changes, and (5) no extreme outliers. Evaluation of statistical significance more generally assumes (6) independence among observations.

In practice, you can violate most of these without practical consequences, except for the importance of independence (discussed above) and considering outliers (which can entirely determine your result). Aside from these two, addressing violations with more obscure methods will generally just lead to your analyses becoming harder to read.[12] 

These claims are not just me being a heterodox cowboy! The general claims have been  Peer Reviewed™.

5.1. Readability matters

Your goal should be to communicate something to an audience. Weird statistics can be an unnecessary obstacle, decreasing accessibility and reading speed. Machine learning researchers, SWEs, and others might have strong technical and mathematical skills but not necessarily broad familiarity with statistics.

If you’re going to be complex, being complex in an expected way (using a method that readers or LLMs can look into) is better than being complex in an original way

5.2. Slight complexity that I still encourage

One easy way to make sure that a (Pearson) correlation result doesn’t come down to an outlier is to simply also test a Spearman rank correlation. A Spearman correlation simply involves transforming every value into a ranking, separately for each of your two variables, and then measuring the Pearson correlation with respect to the rankings. For analysis, I usually test Spearman correlations by default. However, when communicating results, I will generally report a normal Pearson correlation for simplicity if the Pearson correlation’s value is close to it, given the slight clarity advantages.

If you have a weird hunch from your data that you should be afraid of something (e.g., non-normality is important), then you definitely should explore the data more to get rid of that fear! This is good even if you don’t eventually end up reporting complex analyses in your results. Stuff complex statistics in an appendix if you feel they’re important.

If you’re doing weird analyses, then you should probably learn about permutation tests…

6. Complexity breeds statistical demons, but careful permutation testing can slay them

Lots of papers introduce elaborate boutique analyses to test novel hypotheses. Generally, simplicity is better, but this is not always possible.

It is not too hard to develop analyses that, at the surface level, appear to generate findings supporting a hypothesis while, in reality, yield pure mathematical artifacts.

In fMRI neuroscience, novel analyses might mix correlations, transformations, normalizations, classifications, and aggregations in new ways. Each new operation adds to the possibility of mathematical artifacts and makes them harder to spot. It’s hard to come up with a concise mech interp example of this, but here is an attempt to generalize one from neuroscience:

Can you spot why this opens the door to artifacts? Answer in footnote.[13] 

In many cases, it will be infeasible for readers or even technically competent and careful reviewers to spot these issues. In some cases, even you might not have spotted them. When you’re fitting classifiers, you need to be wary of exotic ways for test set contamination to happen.

Permutation testing is the gold standard for checking whether your weird analysis has a problem. This applies when either evaluating statistical significance or fitting a classifier.

Permutation testing involves essentially shuffling your dependent variable early on and then seeing what results would happen in that case. For the physics and chemistry example, you would want to: (i) take all of the physics/chemistry labels and assign them to examples randomly, (ii) redo the analysis, (iii) repeatedly do this >100 times to get a sense for what your result distribution would be like if your dependent variable was random. For example, if your original finding only corresponds to the top 20% in the distribution, your p-value is only p = .20.[14] 

Permutation testing is really nice for sanity checking. Other uses include:

7. Conclusion

That’s all I’ve got. Feel free to read my (hopefully friendly) opening poem again for a concluding summary.

  1. ^

     Mentioning frequentism, some readers may react: Wait, aren’t Bayesian methods just better? I say: No! Unnecessary Bayesian analyses will damage clarity, and in practice, many papers end up just treating Bayesian methods like frequentist ones anyhow (e.g., using Bayes Factor thresholds of >10 for "weak" or >100 for "strong" evidence). There are cases where Bayesian techniques are helpful, but from what I can tell, they are a small minority, and frequentist statistics are great for communicating self-contained research outputs.

  2. ^

     These “>50%” and “80%” percentages refer to  “statistical power”. Statistical power is the likelihood of finding evidence significantly supporting your hypothesis, given your hypothesized effect size, your collected sample size, and your significance threshold. For example, if I hypothesize that two variables are correlated at r = .3, then if I collect 40 data points, then I have an X% chance of finding a significant (p < .05) correlation (i.e., I have X% power). Power calculations are only as trustworthy as the input variables, but social studies studies are generally recommended to have at least 80% power. This footnote is just meant to be fun facts, and you almost certainly don’t need to worry about formal calculations of power for mech interp work, although you may find this concept to be useful to think about informally. Of note, if your final p-value is .01 < p < .05 that means that your experiment was likely underpowered and you got lucky in finding an effect… so you should probably go collect more data.

  3. ^

     It is generally considered a “questionable research practice” to test your hypothesis after each data point has been collected and then cease collection once you’ve hit statistical significance. It is true that if your significance threshold is p < .05 for a one-sample t-tests, you are willing to collect up to 100 data points, and you test for significance after each point, then you will achieve p < .05 about 12% of the time rather than the desired 5% of the time. However, if you set the significance threshold lower, then you can maintain a suitable false positive rate. For example, if you repeatedly test your hypothesis and keep collecting data until you reach p < .01, then over 100 data points your overall false positive rate will remain only about 5% and thus you’ve effectively achieved p < .05. Some papers have talked about the math behind this. On average, repeated hypothesis testing in combination with an optimally stricter threshold (e.g., p < .01) requires less participants on average to find a significant effect than simply collecting a predetermined sample-size and only testing the hypothesis once at p < .05. Much like the prior footnote, this info should be treated as mostly fun facts and hopefully this just helps you build intuition. Main takeaway: Don’t feel bad about repeatedly testing a hypothesis after each data point has been collected, so long as you use a strict threshold.

  4. ^

     That is, the proportion of variance attributable to between-subject differences will be far greater than the variance attributable to within-subject differences. The effective sample size - the N for which the central limit theorem applies - thus predominantly depends on the number of subjects.

  5. ^

     This is less of a problem if you have a massive sample size (e.g., you have a hundred observations per group but millions of groups).

  6. ^

     This number is so high that I’m even a bit scared I may have an error somewhere… a random forest classifier hits 65% accuracy, which feels more believable and still conveys the same point.

  7. ^

     These matrices were computationally expensive to produce or else we would’ve had much more data.

  8. ^

     Another downside to this mass-pooling strategy is that it may lead to finding effects based on across-response differences, which will usually be less interesting. That is, the different responses may vary in the average value of the two matrices (i.e., uniform shifts updward/downward). This would drive the mass-pooling correlation. Even if you were interested in this, you would be better off simply averaging the data by response then computing the correlation among the twenty averages.

  9. ^

     The technically most proper way to do this is with a multilevel linear regression. Multilevel regressions are becoming somewhat common in psychology and neuroscience research. However, you probably don’t want to do this for mech interp research because most readers will not have heard of them. The strategy we describe here is technically inappropriate for some statistical reasons, but it should be acceptable in practice.

  10. ^

     Lots of times, there are intuitive non-linear tricks you can do. For instance, in the example above where you are comparing same-SAE-latent tokens vs. random-tokens, you can do a paired analysis where for each token-pair in the former group you select one random-token-pair that is the same distance apart.

  11. ^

     Dynomight has a post touching on related issues: “Why controlling for a variable doesn't (usually) work”... although see this Contra. Both posts are consistent with the message here about how this can be a particularly big deal when confounds have massive impacts rather than being subtle.

  12. ^

     If a p-value is p = .04 vs. p = .06 based on slight variations in how your correlation or t-test is performed, then your evidence is weak anyhow and you should go collect more data or treat the finding as fake.

  13. ^

     The z-scoring will cause the similarity measure to be shifted downwards - e.g., if each token’s residual stream data was simulated randomly, then the mean similarity between tokens would be below zero. This is because one token’s vector being high in one dimension now forces other tokens’ vectors to be lower in that dimension. Furthermore, this effect will differentially impact the physics or chemistry similarity measurements depending on how many examples you have in each class (e.g., 40 physics problems but 60 chemistry ones). At least in neuroscience, this type of artifact can easily be larger than any hypothesized effect.

  14. ^

     For hierarchical data, permutation testing can be done while maintaining the hierarchical structure (e.g., only shuffling data within a given hierarchy group), which is useful for evaluating whether you are somehow violating independence assumptions.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

统计分析 p值 数据独立性 效应量 混淆变量
相关文章