Last Updated on May 14, 2024 by Editorial Team Author(s): Ayo Akinkugbe Originally published on Towards AI. Photo by Girl with red hat on Unsplash Overview This project solves a classification problem with a multilayer perceptron designed from the ground up. The model is used to predict if a customer is likely to exit a bank service subscription. Below are highlights covered in each section: Introduction Model Architecture Dataset Code Implementation Model Evaluation using a Confusion matrix, Accuracy, Precision, Recall and F1-Score Model Comparison Using Scikit Learn Introduction For a Perceptron, inputs are combined with weights and biases to derive a weighted sum. The calculated weighted sum is passed through a linear activation function or a step function to generate an output. However a single perceptron architecture isn’t scalable for a lot of real problems. In fact, Marvin Minsky and Seymour Papert highlight in their 1969 book titled, Perceptrons: an Introduction to Computational Geometry that this type of architecture found in a simple perceptron can only solve linearly separable problems. Most real-world problems aren’t linearly separable. A multilayer perceptron provides the nuance required to solve more complex problems and find patterns in data that are not linearly separable. The default design of a neural network includes: an input layer — layer containing preprocessed feature data hidden layer(s) — the hidden layer contains neurons that ingest weighted inputs and produce an output using an activation function output layer — layer containing the desired prediction. For classification problems, predictions are often probabilities or numbers that depict the likelihood of occurrence. They are further encoded to the desired output based on a threshold or maximum. For example — using np.argmax on the output matrix for a multi-class neural network prediction produces the index label of the maximum value in the output matrix. Unlike the Perceptron, the hidden layers in a multilayer perceptron uses an non linear activation function. Some examples of non-linear activation function include the Sigmoid function (used in this case), Rectified Linear Unit(ReLU), Leaky ReLU and Softmax. Architecture This project uses a fully connected MLP architecture. In a fully connected MLP, also known as a dense MLP, each neuron in one layer is connected to every neuron in the next layer. This type of architecture allows for complex nonlinear mappings but runs the risk of overfitting with large datasets. The MLP network in this case has an input layer, 2 hidden layers and an output layer. For the forward pass, the activation function used in each layer and neuron is the Sigmoid function. The Sigmoid function takes in x, which is the weighted sum of the input for the neuron in every case. Backpropagation is not used in this case. For optimization of weights and biases, Cross Entropy Loss is leveraged as an objective function. The network uses a total of 21 weights and 3 biases. The output y is a number between 0 and 1. A threshold function is used in the implementation to convert y to desired output. Dataset This project uses the customer churn dataset for a bank referred to as ABC multi-state bank. Each row from the dataset represents details of customers at the bank. Originally the dataset has 11 features and 1 label. The features are reduced to 3 (Tenure, NumOfProducts , HasCrCard). The label Exited is 1 if the customer stops using the bank subscription. It is 0 if the customer is still a customer of the bank. The task with this dataset is to predict if a customer would stay or leave the bank given the 3 features selected. Learn more about the dataset here Implementation This section delineates the Python code implementation of the MLP build process. The neural network is built from scratch using only numpy library and compared with results from Scikit learn library. A copy of the code and data files for this project can be found here. To implement MLP design for classification with Python: Step 1 — Import and Process Data The first step of the process involves importing and preprocessing the data. In this phase, features for predictions are selected. Also the data is transformed into a numpy array to allow for easier selection and computation in the network, # Import required python libraries import numpy as npimport pandas as pdfrom scipy.optimize import minimize # Read data from csv filesample_data = pd.read_csv('CustomerChurn.csv')sample_data # make a copy of the datadata = sample_data.copy()# choose data features for predictiondata = data[['tenure','products_number','credit_card','churn']] # convert data to numpy arraydata = data.valuesdata # Split data columns into Features (X) and Label (Y)X = data[:, 0:3]Y = data[:, -1] Step 2 — Create Forward Pass This step computes each layer of the network. The weighted sum of the inputs from one layer is passed to the next. Each neuron in the hidden layers is activated using the Sigmoid function. The output y is a value between the range of 0 and 1. # define sigmoid functiondef sigmoid(x): return 1 / (1 + np.exp(-x))def output(inputs, weights): # Extracting weights for layers w11, w12, w13, w21, w22, w23, w31, w32, w33,w41, w42, w43,w51, w52, w53,w61, w62, w63,w4, w5,w6, b1, b2, b3 = weights x1, x2, x3 = inputs.T # Hidden layer h1 = sigmoid(w11 x1 + w12 x2 + w13 x3 + b1) h2 = sigmoid(w21 x1 + w22 x2 + w23 x3 + b1) h3 = sigmoid(w31 x1 + w32 x2 + w33 x3 + b1) h4 = sigmoid(w41 h1 + w42 h2 + w43 h3 + b2) h5 = sigmoid(w51 h1 + w52 h2 + w53 h3 + b2) h6 = sigmoid(w61 h1 + w62 h2 + w63 h3 + b2) y = sigmoid(w4 h4 + w5 h5 + w6 * h6 + b3) return y Step 3— Create Objective Function (Cross Entropy) This architecture does not optimize using back propagation. Instead Cross entropy loss function is leveraged to find the optimal weights […]