What is K-Nearest Neighbors?!

Blog on Text Analytics - Provalis Research 2024年11月27日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文介绍了一种简单的机器学习技术——K近邻算法（KNN）。KNN 算法就像一个“懒惰学习者”，它存储整个数据集而不是构建模型，因此在预测时需要进行大量的计算。文章通过一个简单的例子说明了 KNN 的工作原理：根据已知数据点的标签，找到新数据点最近的 K 个邻居，并将其分配给邻居中出现最多的标签。KNN 既可以用于分类问题，也可以用于回归问题，其主要优势在于易于理解和对噪声数据的鲁棒性。文章还简要讨论了 KNN 的参数 K 对算法性能的影响，并预告了后续文章将探讨如何优化 K 的选择。

🤔 **KNN算法是一种简单的机器学习技术，它存储整个数据集而不是构建模型，因此被称为“懒惰学习者”。** KNN算法的核心思想是：根据新数据点周围最近的K个邻居的标签，来预测新数据点的标签，这使得它易于理解和应用。

📊 **KNN算法可以应用于分类和回归问题。** 在分类问题中，KNN算法会选择最近的K个邻居中出现最多的标签作为新数据点的标签；在回归问题中，KNN算法会计算最近的K个邻居的平均值作为新数据点的预测值。

🔎 **KNN算法的主要优势在于易于理解和对噪声数据的鲁棒性。** 由于KNN算法没有构建模型的过程，因此它对噪声数据具有较强的鲁棒性。同时，KNN算法的原理简单易懂，容易理解和应用。

⚙️ **KNN算法的参数K决定了算法的性能。** K值的选择会影响算法的预测结果，过小的K值会导致算法对噪声数据敏感，过大的K值会导致算法的预测结果过于平滑。

⏳ **KNN算法的计算成本在预测时较高。** 由于KNN算法需要存储整个数据集，并在预测时计算新数据点与所有数据点的距离，因此它的计算成本在预测时较高，尤其是在数据集规模较大时。

In our previous post on machine learning, we explained that there are supervised and unsupervised beasts lurking in machine learning land! One of the simplest machine learning techniques is the K-nearest neighbors (KNN). In this post, we will briefly review the way KNN works.

We’ve all heard of lazy guys! With technology being more accessible and easier to use, it’s even easier to get lazy! What to do if you are on the couch and the remote is on the other side of the room? Just relax, download an app to change the TV channel right from your smartphone! You are going to love KNN. It is very lazy! KNN is the couch potato of algorithms. It basically stores the entire dataset rather than modeling it. Well, you can use some complex data structures to improve the efficiency but there still is no model! So, the first question that comes to mind with respect to KNN is: “is it really machine learning!?” Hmmm, yeah it can be but it is definitely a “lazy learner.” KNN can be applied to both classification (i.e. predicting discrete values) and regression (i.e. predicting continuous values) problems. Oddly enough, being lazy can come in handy in some cases. There are some advantages to slothfulness! The main strength of KNN lies in the ease of interpretation. Besides, it is a relatively robust algorithm for noisy data. And, since there is no model, there is no training. But all the work happens at the time of the prediction so the computation cost can be significant in some cases. Okay, we have a lazy tool but how does it really work? Let’s assume you hired a sneaky ninja and obtained data on your colleagues’ weight and age group. For simplicity, assume there is some sort of positive relationship between age group and weight and you have a plot like the following:

Now, a new handsome guy is joining your company and somehow you already know his weight and rather than asking him directly, you want to predict his age group! How? Let’s put him on the plot first…

Technically speaking, you would like to find the label (age group) of the new data point (new guy). Suppose, the new data point can either belong to the red diamond or blue circle groups and nothing else. Now it’s time to find out the role of “K” in the KNN algorithm! As a parameter given to the algorithm, K tells the algorithm how many neighbor data points it should consider when making the decision about the new data. For example, if we set K=3 then the algorithm checks for the 3 nearest neighbors to the new data point and in the case of a classification, it assigns the majority label among the 3 nearest neighbors to the new data point. Hmmm, sounds easy! KNN can be also used for regression problems. There instead of returning the label with the highest vote, the algorithm returns a continuous value by, for example, averaging the outcomes of the nearest neighbors.

Although KNN is very simple it can perform nicely in many cases. As seen, the most important parameter to tune is K. Setting different values for K can affect the performance of the algorithm. But how to tune it up?!? Wait for the next post

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签