Data Science Interview Question: Creating ROC & Precision-Recall Curves From Scratch

Last Updated on June 3, 2024 by Editorial Team Author(s): Varun Nakra Originally published on Towards AI. This is one of the popular data science interview questions which requires one to create the ROC and similar curves from scratch, i.e., no data on hand. For the purposes of this story, I will assume that readers are aware of the meaning and the calculations behind these metrics and what they represent and how are they interpreted. Therefore, I will focus on the implementation aspect of the same. We start with importing the necessary libraries (we import math as well because that module is used in calculations) import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport math The first step is to generate the ‘actual’ data of 1s (bads) and 0s (goods), because this will be used to calculate and compare the model accuracy via the aforementioned metrics. For this article, we will create the “actual vector” from Uniform distribution. For the subsequent and related article, we will use Binomial distribution. actual = np.random.randint(0, 2, 10000) The above code generates 10,000 random integers belonging to [0,1] which is our vector of the actual binary class. Now, of course we need another vector of probabilities for these actual classes. Normally, these probabilities are an output of a Machine learning model. However, here we will generate them randomly making some useful assumptions. Let’s assume the underlying model is a ‘logistic regression model’, therefore, the link function is logistic or logit. The figure below describes the standard logistic function. For a logistic regression model, the expression -k(x-x_0) is replaced by a ‘score’. The ‘score’ is a weighted sum of model features and model parameters. Thus, when the ‘score’ = 0, the logistic function must pass through 0.5 on the Y-axis. This is because logit(p) = log-odds(p) = log(p/(1-p)) = 0 => p = 1-p => p =0.5. Also notice that when the ‘score’ attains high positive or high negative values, the function asymptotically moves towards either 1 (bad) or 0 (good). Thus, the higher the absolute value of ‘score’ is, the higher the predicted probability is as well. But what are we scoring? We are scoring each data input present in our ‘actual vector’. Then, if we want to assume that our underlying logistic regression model is skilled, i.e., predictive; the model should assign comparatively higher scores to bads vs goods. Thus, bads should have more positive scores (to ensure that the predicted probability is close to 1) and goods should have more negative scores (to ensure that the predicted probability is close to 0). This is known as rank ordering by the model. In other words, there should be discrimination or separation between the scores and hence the predicted probabilities of bads vs goods. Since, we have seen that the score of 0 implies probability of good = probability of bad = 0.5; this would mean the model is unable to differentiate between good and bad. But since we do know that the data point will be actually either good or bad, therefore, a score of 0.5 is the worst possible score from the model. This gives us some intuition to move to the next step. The scores can be randomly generated using the Standard Normal distribution with a mean of 0 and a standard deviation of 1. However, we want different predicted scores for bads and goods. We also want bad scores should be higher than the good scores. Thus, we use the standard normal distribution and shift its mean to create a separation between the goods and the bads. # scores for badsbads = np.random.normal(0, 1, actual.sum()) + 1# scores for goodsgoods = np.random.normal(0, 1, len(actual) - actual.sum()) - 1plt.hist(bads)plt.hist(goods) In the aforementioned code, we sampled bads scores and goods scores from two different standard normal distributions but we shifted them to create a separation between the two. We shift the bads scores (represented by the blue color in the image) by 1 towards the right and vice-versa by 1 towards the left. This ensures the following: The bads scores are higher than the goods scores for a substantially high (as per the visual) cases The bads scores have proportionately higher number of positive scores and the goods scores have proportionately higher number of negative scores We can of course maximize this separation by increasing the ‘shift’ parameter and assign it values higher than 1. However, in this story, we won’t do that. We will explore that in the subsequent related stories. Now, let’s look at the probabilities generated by these scores. # prob for badsbads_prob = list((map(lambda x: 1/(1 + math.exp(-x)), bads)))# prob for goodsgoods_prob = list((map(lambda x: 1/(1 + math.exp(-x)), goods)))plt.hist(bads_prob)plt.hist(goods_prob) As discussed earlier, when the ‘scores’ are pushed through the logistic function, we get the probabilities. It is evident that the bad probabilities (blue color) are higher (and skewed towards 1), than the good probabilities (orange color) (and skewed towards 0). The next step is to combine the actuals and predicted vectors into one single data frame for analysis. We assign bad probabilities where the data instance is actually bad and vice-versa # create predicted arraybads = 0goods = 0predicted = np.zeros((10000))for idx in range(0, len(actual)): if actual[idx] == 1: predicted[idx] = bads_prob[bads] bads += 1 else: predicted[idx] = goods_prob[goods] goods += 1 actual_df = pd.DataFrame(actual, columns=['actual'])predicted_df = pd.DataFrame(predicted, columns=['predicted'])predicted_df = pd.concat([actual_df, predicted_df], axis = 1)predicted_df = predicted_df.sort_values(['predicted'], ascending = False).reset_index()predicted_df = predicted_df.drop(columns = 'predicted') The next step is to create bins. This is because the curves that we want to generate are discrete in nature. For each bin, we calculate our desired metrics cumulatively. In other words, we generate cumulative distribution functions for the discrete random variables — goods and bads. The number of bins is arbitrary (we assign n_bins = 50). Note the use of floor function. This is because the length of the data frame may not divide equally into 50 bins. Thus, we take the floor of it and modify our code such […]

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签