未知数据源 2024年11月26日
Linear regression in the wild
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分享了一道数据科学面试题,涉及到线性回归模型在处理测量误差时的应用。题目给定了一组包含x和y变量的样本数据,其中y是x的显式函数,并且x的测量精度是y的两倍。任务是建立y关于x的函数模型。文章首先使用普通线性回归进行建模,并验证了模型假设。随后,考虑到x存在测量误差,文章引入了Deming回归方法,该方法适用于变量存在独立正态分布误差且误差方差比已知的情况。最后,文章比较了两种模型,并强调了在不同应用场景下选择合适模型的重要性。

🤔 **问题描述:**面试题给定了一组包含x和y变量的样本数据,y是x的显式函数,x的测量精度是y的两倍,要求建立y关于x的函数模型。

📊 **普通线性回归建模:**文章首先使用普通线性回归进行建模,并通过残差分析和QQ图验证了模型假设,如残差的独立性、同方差性以及正态性。

🧮 **Deming回归建模:**考虑到x存在测量误差,文章引入了Deming回归方法,该方法适用于变量存在独立正态分布误差且误差方差比已知的情况,并根据题目信息计算了误差方差比。

📈 **两种模型对比:**文章比较了普通线性回归和Deming回归两种模型,指出在需要考虑测量误差时,Deming回归更能反映真实的函数关系。

💡 **结论:**文章强调了在实际应用中,需要根据具体情况考虑变量测量误差的影响,并选择合适的模型进行建模,避免因忽略误差而导致模型偏差。

In one of my job interviews for data scientist position, I encountered a home assignment I'd like to share.

The interviewer sent me a csv file containing samples of measured quantities $x$ and $y$, where $y$ is a response variable which can be written as an explicit function of $x$. It is known that the technique used for measuring $x$ is twice as better than that for measuring $y$ in the sense of standard deviation.

The task: model $y$ as a function of $x$.

Here are all the imports I'll need:

In [1]:
import pandas as pdimport numpy as npfrom sklearn.linear_model import LinearRegressionfrom scipy.stats import probplotimport matplotlib.pyplot as plt%matplotlib inline
In [2]:
data = pd.read_csv('data.csv', names=['x', 'y'])data.head()
Out[2]:
x y
0 12.209516 36.021575
1 62.623142 162.746072
2 -14.712353 166.459738
3 130.624579 551.308062
4 121.652246 208.207596

Let's visualize the data, to see if it's easy to capture the pattern by eye:

In [3]:
data.plot.scatter('x', 'y', title='data')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0xa1b0320>

It clearly looks like linear regression case. First I'll manually remove the outliers:

In [4]:
data = data[data['x'] < 600]data.plot.scatter('x', 'y', title='data without outliers')
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0xa2256a0>

I'll use LinearRegression to fit the best line:

In [5]:
lr = LinearRegression().fit(data[['x']], data['y'])data.plot.scatter('x', 'y', title='linear regression')lr_predicted_y = lr.predict(data[['x']])plt.plot(data['x'], lr_predicted_y)
Out[5]:
[<matplotlib.lines.Line2D at 0xb251320>]

Visually it looks compelling, but I'll validate linear regression assumptions to be assured I'm using the right model.

If you're not familiar with the linear regression assumptions, you can read about it in the article Going Deeper into Regression Analysis with Assumptions, Plots & Solutions.

Let's plot the residual errors:

In [6]:
residuals = lr_predicted_y - data['y']plt.scatter(x=lr_predicted_y, y=residuals)plt.title('residuals')
Out[6]:
<matplotlib.text.Text at 0xb4e6080>

    It seems like there's no autocorrelation in the residuals.

    Heteroskedasticity also doesn't look to be a problem here, since the variance pretty much looks constant (except for the left part of the plot, but there's not a lot of data, so I'll ignore that).

    Multicollinearity is not relevant here, since there's only one dependant variable.

    Residuals should be normally distributed: I'll verify that using QQ-plot:
In [7]:
_ = probplot(residuals, plot=plt)

It looks reasonably normal...

I'll conclude that the relationship between $x$ and $y$ assuming linear relationship is best modeled as

In [8]:
print 'y = %f + %f*x'  % (lr.intercept_, lr.coef_)
y = 70.023655 + 2.973585*x

We got a consistent estimator of the parameters required for calculating $y$ given $x$ (where both have measurement errors), or in other words, the line's coefficients.

Up until now all I've done is plain old linear regression. The interesting thing about this task is that $x$ has measurement error (which is typical in real world use cases).

If we want to estimate the parameters required for calculating $y$ given the exact $x$ value (without measurement error), we need to use a different approach. Using simple linear regression without accounting for $x$ being random with noise results in line slope slightly smaller than the true line slope (the line describing $x$ without measurement errors). You can read this wiki page for learning why.

I'll use Deming regression, which is a method that can be used when the errors for the two variables $x$ and $y$ are assumed to be independent and normally distributed, and the ratio of their variances, denoted $\delta$, is known. This approach is a perfect fit for our settings, where we have

The technique used for measuring $x$ is twice as better than that for measuring $y$ in the sense of standard deviation.

So in our settings, $\delta$ is 2 squared.

Using the formulas found in the wiki page, we get

In [9]:
cov = data.cov()mean_x = data['x'].mean()mean_y = data['y'].mean()s_xx = cov['x']['x']s_yy = cov['y']['y']s_xy = cov['x']['y']delta = 2 ** 2slope = (s_yy  - delta * s_xx + np.sqrt((s_yy - delta * s_xx) ** 2 + 4 * delta * s_xy ** 2)) / (2 * s_xy)intercept = mean_y - slope  * mean_x

Using Deming regression, the relationship between $x$ and $y$ is modeled as

In [10]:
print 'y = %f + %f*x'  % (intercept, slope)
y = 19.575797 + 3.391855*x
In [11]:
data.plot.scatter('x', 'y', title='linear regression with & without accounting for $x$ error measurements')plt.plot(data['x'], lr_predicted_y, label='ignoring errors in $x$')X = [data['x'].min(), data['x'].max()]plt.plot(X, map(lambda x: intercept + slope * x, X), label='accounting for errors in $x$')plt.legend(loc='best')
Out[11]:
<matplotlib.legend.Legend at 0xb961278>

We fit two models: one is a simple linear regression model, and the other is a linear regression model accounting for the $x$ measurement errors.

The more simple one may be enough if our purpose is to calculate $y$ given a new $x$ where the new $x$ also has measurement error (resulting from the same distribution as the measurement errors used when training the model).

If we want to state the true relationship for $y$ as a function of $x$ in a world without measurement errors, we should go with the second model.


This was a great interview question, since I got to learn a new kind of model, which is pretty nice :)

Although this isn't really an example for linear regression in the wild as the title suggests (ok, I lied), this post does demonstrate an important concept many people don't deal with: the dependant variables in many cases are measured in an inaccurate way, which might need to be accounted for (depending on the application).

Regress with care!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

线性回归 测量误差 Deming回归 数据科学 面试题
相关文章