未知数据源 2024年11月26日
Detecting Anomalies in a SMART Way (Part Two)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Fortscale如何利用贝叶斯推断框架构建个性化自适应用户阈值,以检测异常用户行为和潜在的内部威胁。文章首先回顾了之前使用个性化自适应阈值触发警报的方法,然后深入探讨了贝叶斯推断的原理和实现细节。通过对用户历史SMART值的建模,并结合贝叶斯方法处理模型参数的不确定性,Fortscale能够为每个用户建立个性化的异常行为模型,从而更精准地识别真正异常的行为,减少误报,提高安全分析效率。

🤔 **贝叶斯推断用于计算异常行为概率:** 贝叶斯推断是一种计算概率的方法,在Fortscale中,它被用于计算特定用户执行异常活动(具有较高SMART值的活动)的概率。通过估计这个概率,可以为每个活动分配相应的风险分数。

📊 **指数分布模型用户SMART值:** 文章使用指数分布来模拟用户SMART值,因为指数分布能够较好地描述SMART值的高值出现的频率较低,且频率下降速度较快的特征。这种模型选择能够有效减少误报。

🧮 **贝叶斯推断估计模型参数λ:** 为了计算概率,需要估计模型参数λ。文章选择了贝叶斯推断方法,因为它能够处理模型参数的不确定性。贝叶斯方法将模型参数视为随机变量,并使用先验分布来表达对参数的初始信念。

🔄 **先验分布和后验分布更新:** 文章采用了伽马分布作为先验分布,因为它具有共轭性质,使得后验分布也为伽马分布,并且可以通过简单的公式更新参数。通过观察用户的SMART值,可以更新先验分布,得到后验分布,从而更精准地估计模型参数。

🛡️ **个性化自适应阈值:** 通过贝叶斯推断计算得到的概率,可以为每个用户设定个性化的自适应阈值。当用户的SMART值超过阈值时,才会触发警报。这种方法能够有效减少误报,提高安全分析效率。

This post was originally published by me at the Fortscale blog.
Fortscale's product helps organizations eliminate insider threats by detecting anomalous user behavior.


In the previous post of this series I described how we at Fortscale use a personalized adaptive threshold for triggering alerts. Each user’s activity is assigned a risk score (known as the SMART value) that triggers a SMART Alert when crosses the user’s threshold. We explained how the more anomalous activities a user performs through time, the higher his threshold gets. This creates the desired effect that the analyst will be notified of a user only when that user performs a truly anomalous activity, even for him.

Moreover, we explained why and how we incorporate the whole organization’s anomalous activity into a user’s threshold so the analyst won’t be bothered by alerts from users who are anomalous, but not anomalous enough compared to other users.

In this post I’ll dive into the technical details of how we’ve implemented the SMART approach. More specifically, I’ll describe the Bayesian inference framework, and will elaborate on how we harnessed it for creating the personalized adaptive user threshold.


Bayesian inference is a method for calculating probabilities. The probability of interest in our case is the probability that a given user will perform an anomalous activity with relatively high SMART value.If we can estimate this probability we will be able to use it in order to assign the appropriate score to each activity.

Let’s take Joe for example. Joe has never performed any drastic anomalous activities. He has generated SMART values mostly in the range of 0.1 to 0.2. Generating a SMART value of 0.6 is very unlikely for him. Let’s say we estimate that the probability of Joe generating a SMART value of 0.6 is as low as 1%. If Joe generates a SMART value of 0.6, we’ll be able to use this probability and give this activity a risk score of 99 (which is 100 minus the probability). We could decide that only risk scores above some threshold (i.e. 95) should trigger an alert. For this post we will assume some pre-configured threshold, but get ready, choosing the right threshold is a difficult task as well. We will describe the methods we use to calculate this threshold in a separate post.

In order to calculate the probability we first need to model our data. Let’s visualize SMART values collected across all the organization users at one of our customers:

Since the majority of the SMART values are zero, which makes sense as most of the time users don’t do anomalous activities, this histogram is hard to interpret... Let’s inspect the same data after discarding all the zeros:

That’s better. Now we can see two patterns in the data:

    High SMART values tend to happen less often than low ones.It seems like the frequency decays in an aggressive manner (it decreases faster than linear pace).

Let \(V\) be a random variable representing a SMART value a user might have. A natural model choice would be the exponential distribution (for some \(\lambda\) which is a parameter of the model):

$$Prob(\text{V = v}) = \lambda e^{-\lambda v}$$

We could have checked that this model has a good goodness-of-fit, but we don’t really have to; we need to remember that our goal is to give high scores to users who perform activities that are extremely anomalous relative to their own behavioral model. We could have chosen other models that can describe the data, e.g., one with an inverse proportional relationship, but in comparison to the exponential model, this would give lower scores for exceptionally high SMART values. We chose the more aggressive exponential model because it better reduces the false positives.

Since we model each user separately, we should investigate and make sure that each user’s data follows the same pattern of more or less exponential decay; and it does.


In order to be able to calculate probabilities for Joe, we first need to estimate his model’s \(\lambda\) from his past SMART values. One frequentist approach would be to use the Maximum Likelihood Estimation (MLE) method in order to estimate \(\lambda\). The problem with this approach is that we get a point estimate: if we had to guess the best \(\lambda\) and we had only one shot, then MLE would probably be our best choice. But we can do better by using other methods which account for uncertainties.

To better illustrate the problem, let’s consider a the simpler and more straightforward problem of estimating how biased a coin is. Let \(p\) be the probability of tossing a HEAD. The goal then becomes to estimate \(p\). If we toss the coin 10 times and 3 of the flips were HEAD, we can estimate \(p\) to be 0.3 using MLE. We obviously have some doubts in this estimation, since there is not enough data; it could be that \(p\) is 0.5, and we got only 3 HEADs due to chance. If we tossed a second coin 1000 times and got 300 HEADs, the MLE would still be 0.3, but our certainty would be greater this time.

There are numerous mathematical approaches we could deploy here in order to account for uncertainties. We chose the Bayesian inference approach. Bayesianism and frequentism are two key approaches used for estimating probabilities, that represent two different philosophical approaches. The Bayesian method treats the model’s parameter as a random variable which means that it can have any value, and there’s a distribution function, called The Prior, that can tell us what values are more likely than others. The frequentist approach on the other hand argues that the model’s parameter is a fixed quantity; there’s nothing random about it, and one can’t state hist prior belief. The Bayesian’s prior will come in handy, so we chose the Bayesian approach.


Let's examine the math. Let \(v\) be some hypothetical SMART value which Joe might generate, and \(Data\) be his past SMART values. We can use some basic probability properties to derive the following:

$$\begin{split}Prob(v | Data) &= \int_{0}^{\infty} Prob(v, \lambda | Data) d\lambda \quad&\text{(1.)} \\&= \int_{0}^{\infty} Prob(v | \lambda, Data) Prob(\lambda | Data) d\lambda \quad&\text{(2.)} \\&= \int_{0}^{\infty} Prob(v | \lambda) Prob(\lambda | Data) d\lambda \quad&\text{(3.)} \\&= \int_{0}^{\infty} \lambda e^{-\lambda v} * Prob(\lambda | Data) d\lambda \quad&\text{(4.)}\end{split}$$

The equation's steps are

    Use probability marginalization: integrate over all of \(\lambda\)'s possible values.Use the chain rule.Use the assumption that SMART values are independent of each other, so \(v\) is independant of \(Data\).Substitute the probability with the evaluation of the exponential distribution: \(v \sim Exp(\lambda)\).

The only piece of the puzzle missing is how to estimate \(Prob(\lambda | Data)\).The bayesian inference approach uses Bayes' rule in the following way:

$$Prob(\lambda | Data) = \dfrac{Prob(Data | \lambda) Prob(\lambda)}{Prob(Data)}$$

Now we have to calculate the probability of the \(Data\) given \(\lambda\), and the probability of \(\lambda\).

The first part is easy; it is the likelihood function. Since we assume SMART values are independent, we get

$$\begin{split}Prob(Data | \lambda) &= \prod_{v_i \in Data} Prob(v_i | \lambda) \\&= \prod_{v_i \in Data} \lambda e^{-\lambda v_i} \\&= \lambda e^{-\lambda \sum_{v_i \in Data}v_i}\end{split}$$

The second part is the Prior, which expresses our beliefs about \(\lambda\) before any evidence is taken into account. It turns out that if we choose a specific Prior function, we benefit the nice property where it’s possible to analytically calculate \(Prob(\lambda | Data)\). Implementation wise, it means that we can write it in a nice closed form without the need to do tedious calculations or employ numerical methods in runtime. I won’t bother you with the details, but in the exponential model case, if you choose the Prior to have the Gamma distribution, then \(Prob(\lambda | Data)\) (also known as the Posterior) is also Gamma, with updated parameters. This kind of Prior is known as conjugate prior.

The Gamma distribution has two parameters which control its shape, \(\alpha\) and \(\beta\). Let’s say we choose \(\alpha_{prior}\) and \(\beta_{prior}\) for the prior. These parameters state our belief for the values \(\lambda\) can take. Once we observe some SMART values of the user, we update our Prior belief and get a Posterior which is a Gamma with the parameters \(\alpha_{prior} + N\) (where \(N\) is the number of SMART values) and \(\beta_{prior} + \sum_{v_i \in \{\text{user's SMART values}\}}v_i\).

If we plug it in our calculations of \(Prob(v | Data)\), and after some tedious calculations that we will spare the reader, we get

$$\begin{split}Prob(v | Data) &= \int_{0}^{\infty} \lambda e^{-\lambda v} * Prob(\lambda | Data) d\lambda \\&= \quad ... \quad \\&= \dfrac{\alpha_{prior} + n}{\beta_{prior} + \sum_{i=1}^{n} v_i + v} \left(\dfrac{\beta_{prior} + \sum_{i=1}^{n} v_i}{\beta_{prior} + \sum_{i=1}^{n} v_i + v}\right)^{\alpha_{prior} + n}\end{split}$$

Since SMART values are continuous, we should not compute \(Prob(V = v | Data)\). Instead, we should calculate \(Prob(V \geq v | Data)\), which is (after integrating)

$$\left(\dfrac{\beta_{prior} + \sum_{i=1}^{n} v_i}{\beta_{prior} + \sum_{i=1}^{n} v_i + v}\right)^{\alpha_{prior} + n}$$

At this point, we have a nice formula for the probability of seeing a SMART value of at least \(v\). This formula has two inputs: the personal data of the user, and the Prior. The higher the SMART values the user has, the higher the probability of seeing \(v\) gets, which consequently means a lower risk score will be given. Once the user has too many anomalies, the risk score \(v\) triggers is lower than the threshold we set, and an alert won’t be created. We can see the adaptive threshold in action! Nice…

Here’s an example of a concrete user’s past SMART values histogram:

The user had many near zero SMART values, and a few high ones. The highest one is quite exceptional, about 0.28. The red curve is the scoring function (being one minus the aforementioned probability, multiplied by 100). The light red area is the area where scores are greater than 95. In this area alerts will be triggered. We can see that the 0.28 SMART value indeed triggers an alert.

In the next post I will elaborate on how to choose a good prior of \(\alpha_{prior}\) and \(\beta_{prior}\), so stay tuned.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

贝叶斯推断 个性化自适应阈值 异常行为检测 内部威胁 SMART值
相关文章