Last Updated on June 4, 2024 by Editorial Team Author(s): Greg Postalian-Yrausquin Originally published on Towards AI. This exercise is part of aproject implemented on a hardware system. The system has automatic doors that allow to be recovered when they fail to operate by the user (to cover the scenario of the mechanism getting stuck, for example). In some cases, this recovery procedure failed, indicating that something deeper might be occurring. At this point the user has to resort to a technician for support. The original dataset was queried from AWS, in order to retrieve it, I devised the following query script (which is reusable): import pandas as pd import boto3 as aws import os import awswrangler as wrimport pyspark.pandas as psfrom itertools import chain, islice, repeat, teeimport numpy as npclass QueryAthena: def init(self, query): self.database = 'database' self.folder = 'path_queries/' self.bucket = 'bucket_name' self.s3_output = 's3://' + self.bucket + '/' + self.folder self.aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID') self.aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY') self.region_name = os.environ.get('AWS_DEFAULT_REGION') self.aws_session_token = os.environ.get('AWS_SESSION_TOKEN') self.query = query def run_query(self): boto3_session = aws.Session(aws_access_key_id=self.aws_access_key_id, aws_secret_access_key=self.aws_secret_access_key, aws_session_token=self.aws_session_token, region_name=self.region_name) df = wr.athena.read_sql_query(sql=self.query, database=self.database,ctas_approach=False, s3_output=self.s3_output) return df With this it is very easy to run a sql like (Athena uses Presto) query to retrieve data from the datalake. I won’t go into the details for this function since it is not the objective of the article df = QueryAthena("""select from table """).runquery()df.describe() As seen here, we have 94 columns in the original dataset, not all can be used as predictors, as some are metadata about the device, customer, timestamp, etc… In the next step I exclude those columns that are unusable and named the target variable with the standard name “Y” #name of the target variableY = "target_"#name of metadata columnsdropped = ["meta_1","meta_2","meta_3","meta_4","meta_5"]clean_df = df.drop(dropped, axis=1)clean_df = clean_df.dropna()clean_df = clean_df.sample(frac=1)clean_df["Y"] = cleandf[Y].values In these next steps I split the dataset into train, validation and test and convert the data into tensors that can be consumed by PyTorch. The tensor objects, a concept borrowed from physics and mathematics are used as a way to arrange data that is fairly generic; which is easier to illustrate with examples: Tensor of dimension 0 es a number, a tensor of dimension 1 is a vector (a collection of numbers), a tensor of dimension 2 is a matrix, a tensor of dimension 3 is a cube of data, and so on. The three datasets used here are for: train: where the model will run and gather intelligence validation: in every step of the model, metrics will be obtained about its accuracy on this set, the results will be used to determine the course of action. test: this dataset will be left alone and used only at the end to inspect the performance of the result. #due to the size of the dataset, it might be necessary to keep only a fraction of it, here 50%clean_dfshort = clean_df.sample(frac=0.5)#predictorsins = cleandfshort.drop([Y,"Y"], axis=1)#target: collection of 1 and 0outs = cleandfshort[[Y,"Y"]]X = ins.copy()Y = outs["Y"]#split train and test from sklearn.utils import resamplefrom sklearn.model_selection import train_test_splitimport mathimport torchX_2, X_test, y_2, y_test = train_test_split(X, Y, test_size=0.25, stratify=Y)#split train and validationX_train, X_val, y_train, y_val = train_test_split(X_2, y_2, test_size=0.25, stratify=y_2)#upsample X train#this is done because the number of hits (fail to recovery) is very low#it is necessary to rebalance the classesdf_t = pd.concat([pd.DataFrame(X_train),pd.DataFrame(y_train)], axis=1)df_majority = df_t[df_t[df_t.columns[-1]]<0.5]df_minority = df_t[df_t[df_t.columns[-1]]>0.5]df_minority_upsampled = resample(df_minority, replace=True, n_samples=math.floor(len(df_majority)0.25)) df_upsampled = pd.concat([df_majority, df_minority_upsampled])df_upsampled = df_upsampled.sample(frac=1).reset_index(drop=True)X_train = df_upsampled.drop(df_upsampled.columns[-1], axis=1)y_train = df_upsampled[df_upsampled.columns[-1]]input_size = X_train.shape[1]#convert to tensorsX_train = X_train.astype(float).to_numpy()X_test = X_test.astype(float).to_numpy()X_val = X_val.astype(float).to_numpy()y_train = y_train.astype(float).to_numpy()y_test = y_test.astype(float).to_numpy()y_val = y_val.astype(float).to_numpy()X_train = torch.tensor(X_train, dtype=torch.float32)y_train = torch.tensor(y_train, dtype=torch.long)X_test = torch.tensor(X_test, dtype=torch.float32)y_test = torch.tensor(y_test, dtype=torch.long)X_val = torch.tensor(X_val, dtype=torch.float32)y_val = torch.tensor(y_val, dtype=torch.long)train_dataset = torch.utils.data.TensorDataset(X_train, y_train)test_dataset = torch.utils.data.TensorDataset(X_test, y_test)val_dataset = torch.utils.data.TensorDataset(X_val, y_val)#batch size to train, one of the parameters we can use for tunningbatch_size = 700#this is a packager for the datasetsdataloaders = {'train': torch.utils.data.DataLoader(train_dataset, batch_size=batch_size), 'val': torch.utils.data.DataLoader(val_dataset, batch_size=batch_size), 'test': torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)}dataset_sizes = {'train': len(train_dataset), 'val': len(val_dataset), 'test': len(test_dataset)}print(f'dataset_sizes = {dataset_sizes}') The output of this is the size of each of the datasets, train, test and validation. The next step is to define the neural network. This might take some time and effort, requiring retraining and testing parameters and configurations until the desired result is achieved. The recommended approach I use is to start with a simple model, see if there is predictive power in it, and then start complicating it by making it wider (more neurons) and deeper (more layers). The objective here is to end with a model that overfits the data. Once we are successful in that, the next step is to reduce overfitting to improve the result metrics on the validation set. We will see more around this in the next steps. This class defines a simple multilayer perceptron. import torch.nn as nn#this class is the final one, after adding the layers and training and iterating to fine the best result class SimpleClassifier(nn.Module): def init(self): super(SimpleClassifier, self).init()#the dropout layer is introduced to reduce the overfiting (so as explained, it is set to 0 or very low at first)#dropout is telling the neural network to drop data between layers randomly to introduce variability self.dropout = nn.Dropout(0.1) #for the layers I recommend to start a little over twice the number of columns and increase from there from a layer to the next#then decrease again down to 2, in this case the response is binary self.layers = nn.Sequential( nn.Linear(input_size, 250), nn.Linear(250, 500), nn.Linear(500, 1000), nn.Linear(1000, 1500), nn.ReLU(), self.dropout, nn.Linear(1500, 1500), nn.Sigmoid(), self.dropout, nn.Linear(1500, 1500), nn.ReLU(), self.dropout, nn.Linear(1500, 1500), nn.Sigmoid(), self.dropout, nn.Linear(1500, 1500), nn.ReLU(), self.dropout, nn.Linear(1500, 1500), nn.Sigmoid(), self.dropout, nn.Linear(1500, 1500), nn.ReLU(), self.dropout, nn.Linear(1500, 500), nn.Sigmoid(), self.dropout, nn.Linear(500, 500), nn.ReLU(), self.dropout, nn.Linear(500, 500), nn.Sigmoid(), self.dropout,#the last layer outputs 2 since the response variable is binary (0,1)#the output of a multiclass classification should be of the size of the number of classes nn.Linear(500, 2), ) def forward(self, x): return self.layers(x)#define modelmodel = SimpleClassifier() The next block deals with the training of the model. These are the training parameters: epochs: number of times the model will be trained. Set it low at first, […]