Published on August 6, 2025 12:45 PM GMT

I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics.

tl;dr

Small p-values are nice.
Unless they're suspiciously small.

Statistical assumptions can be bent.
Except for independence.

Practical significance beats statistical significance.
Although practicality depends on context.

The measure is not the confound.
Sometimes it’s close enough.

Readability often beats rigor.
But fearing rigor means you probably need it.

Simple is better than complex.
Complex is better than wrong.

Complex wrongs are the worst.
But permutation tests can help reveal them.

This post offers advice for frequentist and classifier-based analysis. I try to focus on points that are practical, diverse, and non-obvious. I emphasize relevancy for mechanistic interpretability research, but this post will hopefully be interesting for people working in any area, including non-researchers who just sometimes read scientific papers.^[1]

1. If you want to make a claim based on p-values, don’t settle for p = .02

A p-value describes the probability that random data would produce an equally strong or stronger effect than what was observed. P-values are the cornerstone of null hypothesis significance testing, where researchers test a claim by attempting to rule out a null hypothesis, which is typically the assumption that there exists no effect or difference between groups.

1.1. Some history and culture

When frequentist statistics were first developed at the turn of the 20th century, there were competing mindsets: Fisher proposed that p-values should be interpreted on a continuous scale, where smaller p-values correspond to heightened evidence. Neyman and Pearson would later argue for a more rigid decision-theoretic framework, where p-values are compared to a predetermined significance level to make binary accept/reject decisions.

Social science papers today usually adopt the binary approach, treating any finding at p < .05 as “significant.” Although researchers and laypeople may maintain suspicion of, say, p = .049, people will commonly endorse claims like p < .001 being no better than p < .05 if both cross a predetermined significance threshold. I suspect that this thinking stems from: (i) a desire to cleverly believe intuitive things and (ii) vague knowledge that frequentist tests don’t quantify evidence the same way that Bayesian tests do. However, this mindset is wrong when conducting research or just reading papers.

1.2. For all practical purposes…

Strong p-values provide greater evidence, while weak p-values often indicate p-hacking and non-replicability. Among replication attempts in psychology, studies where the main effects were p ≤ .005 replicated 74% of the time, while those between .005 < p < .05 replicated at just 28%. Looking at papers reporting many statistics, ones where more than a third of significant p-values were p > .01 replicated just 33% of the time, whereas papers where fewer than a third of p-values were p > .01 replicated 62% of the time.

The easiest way to predict whether a paper will replicate is by examining its p-values.

You may know to be concerned about p = .049, but p = .02 is also problematic! In a non-p-hacked study with an adequate sample size, you should only expect around a quarter of p-values to be .01 < p < .05. If you’re reading a paper where a majority of p-values are like that, then you should probably treat it as, at best, weak evidence for whatever they are claiming (or more practically, you should avoid spending any further time reading).

True effects produce small p-values. A sentiment among some researchers is that achieving significance at .01 < p < .05 means a study efficiently collected just enough data to cross the threshold of significance. This is wrong. If you are conducting a test that has a >50% chance to yield a statistically significant (p < .05) finding supporting your hypothesis, then a majority of those significant p-values should be p < .01. If you are conducting a study that has an 80% chance of yielding a statistically significant finding supporting your hypothesis, then almost three-quarters of significant p-values will be p < .01. Furthermore, two-fifths should be at p < .001 (!); really strong p-values should be commonplace rather than rare.^[2]

This post is, in large part, about mechanistic interpretability (mech interp), and there is little impetus to p-hack in mech interp! If you have a p-value of .01 < p < .05, just go and collect more data. Data collection is usually cheap, so go pursue p < .001.^[3]

2. Independence among observations is critical

Statistical independence is foundational when evaluating significance tests or classifier (probe) accuracy. Independence means that knowing one sample's value doesn't provide information about another's.

Independence issues are common: Across many fields, including mech interp, datasets are often hierarchical, where different data points are related to each other in some way.

Ignoring independence can be disastrous: Due to interdependencies, random data can produce microscopic p-values or produce seemingly perfect classifiers via test-set contamination.

2.1. Data is often hierarchical

Hierarchical structures are ubiquitous in interesting datasets. In mech interp, you might look at different residual stream measurements from the same response, different responses from the same prompt, or different prompts from the same task category. In psychology and neuroscience, you will commonly have multiple brain measurements from the same trial, multiple trials from the same participant, and possibly even multiple participants from the same group.

Across these cases, if you ignore hierarchies and treat related measures as independent observations, you will wildly underestimate the extent of variability in the real world. You will think you are generalizing (achieve a tiny p-value or high cross-validation accuracy) when you are actually overfitting.

2.2. Hierarchical interdependencies are large

I’m not trying to convince you to adhere to statistical technicalities. Instead, I want to convey suggestions with major practical implications. If your findings seem way too strong or your p-value is unexpectedly tiny, that’s a strong sign that you are messing up statistical independence somewhere (or you have some artifact in your analysis).

In many neuroscientific studies, statistical effects due to hierarchical interdependencies will tend to be much larger than those from virtually any actual effect of interest.^[4] For example, if I show two participants a picture of a cow and a picture of a dead body, then even emotion-processing regions’ brain responses will tend to display much stronger similarities within-subject than within-condition. That is, Alex’s neural response to a cow will be much more similar to Alex’s response to a dead body (think r = .50) than to Bailey’s response to a cow (think r = .02). Ignoring interdependencies can easily invalidate an analysis.

In mech interp, I doubt that effects will be as severe as in human brain data, but my hunch is that great caution regarding independence is still necessary if your data is hierarchical.^[5] For both brain and mech-interp analyses, the widespread use of multivariate analytic techniques will magnify the effect of non-independence and create these issues; by “multivariate,” I just mean looking at multiple dimensions, like fitting a classifier based on multiple voxels in an fMRI scan or a probe on the many dimensions of a residual stream vector.

To illustrate this for mech interp, I extracted the residual stream activity of layer 10 at every token position for R1-Qwen-1.5B responding to 50 MMLU questions. The residual stream of neighboring tokens will tend to be highly similar. I performed 50-class classification, while treating each token’s residual-stream vector as an example and pooling every example into one big dataset with over 10,000 examples. Using 5-fold cross-validation, a simple logistic regression classifier can achieve over 90% accuracy (!) (chance accuracy would be just ~2%).^[6]

If I ignored these interdependencies, it could be painful. Suppose I wanted to test whether a model represents its uncertainty at each token position, so I fit a classifier based on the token-level residual stream and attempted to predict whether the model’s final response was correct or incorrect; this hypothetical test presumes that even when a model eventually gives an incorrect response, it would be somewhat aware it may be incorrect. If I followed the above procedures, using every token’s data as an example and performing basic 5-fold cross-validation, the classifier might just learn the 50 responses’ unique intricacies rather than generalizing at all.

One way to make sure classifier accuracy reflects generalization would be to instead use group-k-fold cross-validation. Note, while this is a correct evaluation, it doesn’t influence training, where the model could still learn response-specific patterns. Including more data from the same response could very well lead to a worse classifier. I encountered a scenario like this recently, and because I only needed a quick-and-easy solution, I decided to simply take one data point from each response and discard the rest.

2.3. Categories allow independence to be easily achieved

It’s often pretty straightforward to convert your hierarchical data into largely independent observations when dealing with categorical/binary independent variables:

In a recent paper

2.4. With continuous variables, things are hairier

When working with continuous variables, handling hierarchical structures correctly becomes less obvious. Further, compared to the ubiquity of papers properly dealing with hierarchies in categorical variables (often just implicitly), examples of papers doing so with continuous variables are rarer.

In our paper mentioned above, we performed an analysis where we computed two matrices describing each chain of thought: (i) a matrix describing the causal relationship between sentences based on changes in token logits, and (ii) a matrix describing the causal relationship between sentences based on legible text. A response’s two matrices were the same size, with each matrix containing 1,000-40,000 elements. We hypothesized that the two matrices would be correlated, and we wanted to draw conclusions that aggregate data across our 20 responses.^[7]

This data is hierarchical. Each response was associated with numerous data points. We could try a naive approach:

Flatten the bottom triangle of each matrix,Concatenate the resulting vectors across all responses,Produce a single dataset containing >100k data points in two columns, one for each measure,Finally, correlate the two columns.

This approach would be inappropriate, and the correlation would generate a microscopic p-value because it treats the data as having >100k independent degrees of freedom, drastically inflating the apparent strength of the statistical significance.

How might this strategy go wrong? Suppose that for 11/20 responses, the two matrices are indeed positively related, whereas for 9/20 responses, the matrices are negatively correlated. In this case, it is clear that finding an overall positive relationship could be spurious, as random chance could’ve easily flipped the scenario to 9 positive and 11 negative responses. However, when the elements are pooled to form a correlation, we treat this akin to having 110k points supporting a positive association while 90k points support a negative one, which would be decisive evidence favoring the positive link.^[8]

We wanted to test the association between each response’s two matrices while respecting the data’s hierarchical properties during aggregation across responses. Hence, we simply computed the element-by-element correlation for each response and reported the average correlation along with the number of responses showing a positive correlation (18/20).^[9]

In general, averaging data to achieve independent observations and performing simple statistical tests is a thoughtful and clear solution.

3. Effect sizes are only meaningful in context

An effect size is a measure of how strongly two variables are related that is ostensibly independent of sample size (e.g., a correlation coefficient or a Cohen’s d). When fitting probes, it is useful to also think of classification accuracy as an effect size.

The concept of an effect size underlies the phrase, “Statistical significance is not the same as practical significance.” From what I can tell, this quip became more popular alongside the rise of the scientific replication crisis, which generally led to greater statistical skepticism among academics and tuned-in laypeople. This quip is generally used to encourage dismissing results with small effect sizes.

This view is sometimes productive. Everything is Correlated once you have enough data, and, compared to in the social sciences where participant recruitment can be costly, achieving large sample sizes for mech interp research can be easy. In turn, if you are interested in evaluating the associations between different measures, you may find plenty of tiny effects. You might test whether the strength of one SAE latent at a given token position correlates with the strength of a different SAE latent 32 tokens later, and find a statistically significant but small association (e.g., a mean correlation of r = .1). This type of weak link might not be too informative.

3.1. Contextualizing the quip

However, it is important to consider the context of an effect size and what other factors are at play that are tangential to the mechanisms you are interested in

Imagine a study where researchers want to assess whether elementary students fed a larger lunch tend to perform better in standardized exams. The researchers look at 40 schools, each containing 100 students. Every student at a given school receives the same lunch, so the researchers randomly assign each school to a different condition (e.g., normal lunch, +100 calorie lunch, +200 calorie lunch).

To analyze the link between extra calories and exam performance, the researchers may measure a correlation on a student-by-student basis (i.e., 4,000 data points). This would likely yield a minuscule correlation (e.g., r = .05). Alternatively, the researchers might average exam performance by school and compute the correlation on a school-by-school basis (i.e., 40 data points), which would yield a stronger correlation (e.g., r = .20). This new correlation is larger, but for reasons orthogonal to the nutrition-performance relationship. A correlation could alternatively be buffed if (i) the exam is changed to a more reliable measure of performance by adding more questions, or (ii) the underlying manipulation could be magnified to +150 and +300 calorie lunches.

In studies less attached to real-world outcomes, these factors make it very difficult to draw claims about “practical” significance based on an effect size. For instance, one psychology study might show a correlation of r = -.20 between participant age and performance on a memory test, while another study shows a correlation of r = -.3 between age and performance on a decision-making test. Laboratory effect sizes will say little about the degree older adults should expect their memory or decision-making to deteriorate. Meaningfully contrasting the effect sizes across different studies would be difficult, let alone evaluating the extent older adults should actively worry about this in their day-to-day.

3.2. Back to mech interp

You may discover yourself in a similar mech interp scenario. You may find stronger correlations when using SAE latents rather than focusing on the activations of neurons one at a time. Likewise, you may find stronger links to SAE latent data pooled across many tokens rather than SAE latents at singular tokens. These same principles apply when fitting classifiers. A probe that achieves moderate accuracy when classifying data based on a single token or sentence may be more impressive than a classifier operating using data aggregated across a whole response.

It’s hard to come up with a clean takeaway message here. Think and work to build intuition about data so that you can better assess how impressive different findings are

4. Perfectly controlling confounds is hard

Sometimes, you’ll be interested in two variables’ relationship but a third non-targeted variable may be linked to both and contribute to their association in an undesired way. This concept is usually thought of in terms of correlations and regressions. However, it is also worth considering confounds when drawing conclusions from classifiers or when searching for effects so huge that statistics are seemingly unnecessary.

For mech interp, confounds are important to think about when looking at associations between different patterns within a model’s internals, as confounds can be very large. Here are a few examples of large realistic confounds:

the above paper

4.1. Challenging confounds

When controlling for major confounds, you should think about reliability and the relationship structure (e.g., linearity). This is particularly important when a confound’s impact is large.

For the purposes of controlling a confound, you can think of reliability as the extent a given measure captures an underlying construct. You may be conducting a study where you want to control for participant intelligence. To do this, it is common to use a participant’s educational attainment (years of school completed) as a proxy for intelligence. However, the measure is not the construct. If educational attainment explains 25% of variability in intelligence (r = .5, R2 = .25), then when you control for education, you’ll still leave a majority of intelligence effects intact.

For mech interp, your measures may be highly reliable, but controlling for their relationship with other variables may be difficult. The typical strategy for controlling a confound is to include it as a variable in a linear regression. However, this assumes the confound’s relationship with other variables is linear. Yet, most relationships are not linear. The position and proximity effects mentioned above are certainly not. If you simply control for them via a linear regression rather than correlation, you will fail to actually scrub out their effects from the data.^[10]^[11]

4.2. Bright possibilities

Confounds being hard to perfectly control doesn’t mean everything is doomed nor do you always need to implement obscure methods for controlling confounds:

In the social sciences and cases where confounding effects won’t be too large compared to your effect of interest, lazy linear solutions are often okay.For mech interp and cases where confounds might be very impactful, but if you’re unsure, you should still initially test lazy methods. A lazy method can show you whether a confound majorly predicts a variable of interest at all. If not, or if lazily controlling for a confound only has a tiny effect on your relationship of interest, then you’re fine.

Conversely, if your lazy control leads to your relationship of interest dropping in size by, say, 33% or 50%, then that is a strong sign that your relationship might truly be entirely confounded, and a more rigorous method could find that. In that case, good luck… maybe try to reorganize your experiment or hypothesis to avoid the confound? Alternatively, you can maybe get an LLM to help you find a statistical approach that is more thorough, although you might always live in fear that you aren’t fully scrubbing out a confound.

5. Simplicity is valuable, and you can usually ignore statistical tests’ assumptions

Most statistical tests come with many assumptions. Pearson correlations technically assume: (1) continuous variables, (2) linear relationships, (3) bivariate normality, meaning that the distribution forms an elliptical cloud, (4) homoscedasticity, meaning that the variance of each variable is stable as the other variable changes, and (5) no extreme outliers. Evaluation of statistical significance more generally assumes (6) independence among observations.

In practice, you can violate most of these without practical consequences, except for the importance of independence (discussed above) and considering outliers (which can entirely determine your result). Aside from these two, addressing violations with more obscure methods will generally just lead to your analyses becoming harder to read.^[12]

These claims are not just me being a heterodox cowboy! The general claims have been Peer Reviewed™.

5.1. Readability matters

Your goal should be to communicate something to an audience. Weird statistics can be an unnecessary obstacle, decreasing accessibility and reading speed. Machine learning researchers, SWEs, and others might have strong technical and mathematical skills but not necessarily broad familiarity with statistics.

If you’re going to be complex, being complex in an expected way (using a method that readers or LLMs can look into) is better than being complex in an original way

5.2. Slight complexity that I still encourage

One easy way to make sure that a (Pearson) correlation result doesn’t come down to an outlier is to simply also test a Spearman rank correlation. A Spearman correlation simply involves transforming every value into a ranking, separately for each of your two variables, and then measuring the Pearson correlation with respect to the rankings. For analysis, I usually test Spearman correlations by default. However, when communicating results, I will generally report a normal Pearson correlation for simplicity if the Pearson correlation’s value is close to it, given the slight clarity advantages.

If you have a weird hunch from your data that you should be afraid of something (e.g., non-normality is important), then you definitely should explore the data more to get rid of that fear! This is good even if you don’t eventually end up reporting complex analyses in your results. Stuff complex statistics in an appendix if you feel they’re important.

If you’re doing weird analyses, then you should probably learn about permutation tests…

6. Complexity breeds statistical demons, but careful permutation testing can slay them

Lots of papers introduce elaborate boutique analyses to test novel hypotheses. Generally, simplicity is better, but this is not always possible.

It is not too hard to develop analyses that, at the surface level, appear to generate findings supporting a hypothesis while, in reality, yield pure mathematical artifacts.

In fMRI neuroscience, novel analyses might mix correlations, transformations, normalizations, classifications, and aggregations in new ways. Each new operation adds to the possibility of mathematical artifacts and makes them harder to spot. It’s hard to come up with a concise mech interp example of this, but here is an attempt to generalize one from neuroscience:

Suppose you hypothesize that, when a model answers physics questions, its journey through latent space is less diverse than when the model answers chemistry questions.To test this, you decide to compare pairs of tokens from different questions and measure how similar their residual-stream vectors are.Suppose you don’t want the correlations to be driven by any singular dimension in the residual stream, so beforehand you decide to z-score each dimension across tokens (e.g., if you have 50 tokens and 1024 dimensions, then you z-score each dimension across the tokens 1024 times).After normalization, you then measure the similarity between different tokens’ residual streams via cosine similarity or Pearson similarity.

Can you spot why this opens the door to artifacts? Answer in footnote.^[13]

In many cases, it will be infeasible for readers or even technically competent and careful reviewers to spot these issues. In some cases, even you might not have spotted them. When you’re fitting classifiers, you need to be wary of exotic ways for test set contamination to happen.

Permutation testing is the gold standard for checking whether your weird analysis has a problem. This applies when either evaluating statistical significance or fitting a classifier.

Permutation testing involves essentially shuffling your dependent variable early on and then seeing what results would happen in that case. For the physics and chemistry example, you would want to: (i) take all of the physics/chemistry labels and assign them to examples randomly, (ii) redo the analysis, (iii) repeatedly do this >100 times to get a sense for what your result distribution would be like if your dependent variable was random. For example, if your original finding only corresponds to the top 20% in the distribution, your p-value is only p = .20.^[14]

Permutation testing is really nice for sanity checking. Other uses include:

Even if your analysis is simple, you might have a bug causing test set contamination.If you’re worried whether your weak classification accuracy (e.g., 55% relative to 50% chance accuracy) is capturing nothing or just is a weak predictor, permutation testing will give you a p-value describing the chance random data would lead to an equally or more accurate classifier.

7. Conclusion

That’s all I’ve got. Feel free to read my (hopefully friendly) opening poem again for a concluding summary.

^{^}
Mentioning frequentism, some readers may react: Wait, aren’t Bayesian methods just better? I say: No! Unnecessary Bayesian analyses will damage clarity, and in practice, many papers end up just treating Bayesian methods like frequentist ones anyhow (e.g., using Bayes Factor thresholds of >10 for "weak" or >100 for "strong" evidence). There are cases where Bayesian techniques are helpful, but from what I can tell, they are a small minority, and frequentist statistics are great for communicating self-contained research outputs.
^{^}
These “>50%” and “80%” percentages refer to “statistical power”. Statistical power is the likelihood of finding evidence significantly supporting your hypothesis, given your hypothesized effect size, your collected sample size, and your significance threshold. For example, if I hypothesize that two variables are correlated at r = .3, then if I collect 40 data points, then I have an X% chance of finding a significant (p < .05) correlation (i.e., I have X% power). Power calculations are only as trustworthy as the input variables, but social studies studies are generally recommended to have at least 80% power. This footnote is just meant to be fun facts, and you almost certainly don’t need to worry about formal calculations of power for mech interp work, although you may find this concept to be useful to think about informally. Of note, if your final p-value is .01 < p < .05 that means that your experiment was likely underpowered and you got lucky in finding an effect… so you should probably go collect more data.
^{^}
It is generally considered a “questionable research practice” to test your hypothesis after each data point has been collected and then cease collection once you’ve hit statistical significance. It is true that if your significance threshold is p < .05 for a one-sample t-tests, you are willing to collect up to 100 data points, and you test for significance after each point, then you will achieve p < .05 about 12% of the time rather than the desired 5% of the time. However, if you set the significance threshold lower, then you can maintain a suitable false positive rate. For example, if you repeatedly test your hypothesis and keep collecting data until you reach p < .01, then over 100 data points your overall false positive rate will remain only about 5% and thus you’ve effectively achieved p < .05. Some papers have talked about the math behind this. On average, repeated hypothesis testing in combination with an optimally stricter threshold (e.g., p < .01) requires less participants on average to find a significant effect than simply collecting a predetermined sample-size and only testing the hypothesis once at p < .05. Much like the prior footnote, this info should be treated as mostly fun facts and hopefully this just helps you build intuition. Main takeaway: Don’t feel bad about repeatedly testing a hypothesis after each data point has been collected, so long as you use a strict threshold.
^{^}
That is, the proportion of variance attributable to between-subject differences will be far greater than the variance attributable to within-subject differences. The effective sample size - the N for which the central limit theorem applies - thus predominantly depends on the number of subjects.
^{^}
This is less of a problem if you have a massive sample size (e.g., you have a hundred observations per group but millions of groups).
^{^}
This number is so high that I’m even a bit scared I may have an error somewhere… a random forest classifier hits 65% accuracy, which feels more believable and still conveys the same point.
^{^}
These matrices were computationally expensive to produce or else we would’ve had much more data.
^{^}
Another downside to this mass-pooling strategy is that it may lead to finding effects based on across-response differences, which will usually be less interesting. That is, the different responses may vary in the average value of the two matrices (i.e., uniform shifts updward/downward). This would drive the mass-pooling correlation. Even if you were interested in this, you would be better off simply averaging the data by response then computing the correlation among the twenty averages.
^{^}
The technically most proper way to do this is with a multilevel linear regression. Multilevel regressions are becoming somewhat common in psychology and neuroscience research. However, you probably don’t want to do this for mech interp research because most readers will not have heard of them. The strategy we describe here is technically inappropriate for some statistical reasons, but it should be acceptable in practice.
^{^}
Lots of times, there are intuitive non-linear tricks you can do. For instance, in the example above where you are comparing same-SAE-latent tokens vs. random-tokens, you can do a paired analysis where for each token-pair in the former group you select one random-token-pair that is the same distance apart.
^{^}
Dynomight has a post touching on related issues: “Why controlling for a variable doesn't (usually) work”... although see this Contra. Both posts are consistent with the message here about how this can be a particularly big deal when confounds have massive impacts rather than being subtle.
^{^}
If a p-value is p = .04 vs. p = .06 based on slight variations in how your correlation or t-test is performed, then your evidence is weak anyhow and you should go collect more data or treat the finding as fake.
^{^}
The z-scoring will cause the similarity measure to be shifted downwards - e.g., if each token’s residual stream data was simulated randomly, then the mean similarity between tokens would be below zero. This is because one token’s vector being high in one dimension now forces other tokens’ vectors to be lower in that dimension. Furthermore, this effect will differentially impact the physics or chemistry similarity measurements depending on how many examples you have in each class (e.g., 40 physics problems but 60 chemistry ones). At least in neuroscience, this type of artifact can easily be larger than any hypothesized effect.
^{^}
For hierarchical data, permutation testing can be done while maintaining the hierarchical structure (e.g., only shuffling data within a given hierarchy group), which is useful for evaluating whether you are somehow violating independence assumptions.

Discuss