少点错误 2024年08月11日
[LDSL#3] Information-orientation is in tension with magnitude-orientation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨在变量预测中,使用常见统计方法会产生巨大偏差,通过建立数学模型分析原因,并提出解决思路。

测量误差与变量幅度成正比,变量的可靠部分会与不可预测的噪声混合,且波动要对变量产生至少成比例的影响,否则可忽略,同时噪声过大也难以分辨原变量。

信息导向会引入巨大偏差,简单统计应用于变量多产生无用结果,我们需更谨慎,通过简化模型将乘法波动转化为加法波动,用标准正态分布近似分析统计如何扭曲事物。

信息过载是现实问题,存在大量子群,按统计方法会先发现不重要的子群,需偏向X值大的情况,作者将在后续探讨基于幅度的偏差来推断事物的主要方法。

Published on August 10, 2024 9:58 PM GMT

Followup to: Latent variable models, network models, and linear diffusion of sparse lognormals. This post is also available on my Substack.

Let’s say you are trying to make predictions about a variable. For example, maybe you are an engineer trying to keep track of how well the servers are running. It turns out that if you use the obvious approach advocated by e.g. frequentist statistics [1], you will have huge biases in what you pay attention to, compared to what you should pay attention to, because you will disregard big things in favor of common things. Let’s make a mathematical model of this.

Measurement error is proportional to magnitude

Because of background factors that fluctuate at a greater frequency than you can observe/model, and because of noise that enters through your indirect means of observing said variable, the reliable portion of the variable will be mixed together with some unpredictable noise. For example, as the engineer observes the server logs, they will see some quite significant fluctuations that are basically unattributable without a lot of care and work.

There are two fundamentally different kinds of fluctuations:

In either case, in order for the fluctuation to be relevant, it has to have an influence that is at least proportional to the ordinary level for the variable; if it was much smaller, it would be negligible. We’d expect some multiplicative fluctuations to exist simply because things tend to work through multiplicative mechanisms, so as a lower bound on the fluctuations, it’s typically the case that there will be linearly proportional/non-negligible noise.

Conversely, if the noise is much than the variable you are tracking, it would be hard to make out the original variable, and you would pick a different set of indicators with less noise, so linearly proportional noise also works as an upper bound.

Information-orientation introduces massive biases

If we try to apply simple statistics to the variable, it will mostly yield garbage. For instance, if we take the average over a short period of time, it can fluctuate wildy due to the outliers, and if we try to apply a regression analysis, it will be strongly confounded by outliers. If we had enough sample size, it’s conceivable we could just average ridiculous amounts of data together to ensure that we only consider reliable results. However, if instead we are non-omniscient, have limited data, and we want to extract as much information as efficiently as possible from the observations, we instead need to be more careful.

A natural way to think about these is that averages and least squares linear regression try to minimize absolute errors[2], but if we want to be robust to noise, we need to instead minimize relative errors. That is, if we predicted that the variable is 10, but it really is 20, that is a much worse prediction than if we predicted that the variable is 110 but it really is 120.

There’s multiple basically isomorphic ways we can formalize this, for instance by fitting a lognormal distribution to the data. However, let’s use a more arithmetically convenient simplified model: if X is the variable and P is the observed proxy, then we can say that log(P) ≈ log(X) + log(E). Here, E is fluctuation/measurement error, expressed multiplicatively as a number to be multiplied by X to get the observation. By taking the logarithms, we make multiplicative fluctuations drop out into additive fluctuations, and we bring outliers in the distribution back into the bulk of the distribution.

Now we can use some standard normal distribution approximations to get at how statistics twist things. Let’s say that there’s a subgroup s of units (e.g. patients with a particular disease) for whom X is different, e.g. on average Xsubgroup=Xnorm+β. For this group, we have log(Psubgroup) ≈ log(Xnorm+β) + log(E) = log(Xnorm) + log(1+β/Xnorm) + log(E).

Thus, log(P) is log(1+β/Xnorm) higher in the subgroup than in the overall group. But we can measure this exactly since there will be some noise depending on log(E). This noise will shrink with sqrt(N), where N is the sample size of the subgroup. So overall our statistical signal for how much is going on with the group will be something like log(1+β/Xnorm) sqrt(N) / std(log(E)).

Obviously if you want to have as much information as possible, you are going to prefer less noisy data, as expressed by the 1/std(log(E)) term. But the actual importance of the subgroup ought to be given by something like βN, yet for statistical purposes it appears their detectability is more like log(β) sqrt(N), which not only massively underrates the importance of long-tailed effects, but also favors commonality over effect size, to the point where it becomes exponentially more important for a group to be common than for them to have a large difference from the norm, in order for it to be detectable.

Information overload

If you were just dealing with a single subgroup, this would probably be fine. At some point you get enough sample size to detect the subgroup, at which point you can use more direct means to estimate the correct β.

The issue is that the real world often contains tons of subgroups. If you are dealing with hundreds of thousands of subgroups, you will find many subgroups of infinitesimal importance before you find the ones that really matter. You can then either let yourself be distracted by all these subgroups, or you can conclude that the statistical approach to identifying them doesn’t really work.

But if raw statistics don’t work, then what does work? It seems to me that there’s a need to be biased towards cases where X is big. Maybe this bias can be derived purely from maximizing probabilistic fit, and if so I have some ideas how that could be achieved, which I will get into later in the series. But I suspect ultimately we simply have to bring this bias with us, as a deviation from raw probability theory. In the next post, I will discuss the main way we can use a magnitude-based bias to infer things.

  1. ^

    Bayesian probability kind of has a similar problem, in that the statistics work out the same. However, if you are willing to let yourself get close to getting Pascal-mugged, I could imagine Bayesians could dodge it by chasing things with small probabilities of being really big.

  2. ^

    Well, absolute squared errors. Point is, not relative errors.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

变量预测 偏差分析 解决思路 信息过载 幅度偏差
相关文章