Use of Pretrained BERT to Predict the Rating of Reviews

Last Updated on June 3, 2024 by Editorial Team Author(s): Greg Postalian-Yrausquin Originally published on Towards AI. BERT is a state-of-the-art algorithm designed by Google to process text data and convert it into vectors (https://en.wikipedia.org/wiki/BERT_(language_model) . These can then by analyzed by other models (classification, clustering, etc) to produce different analyses. What makes BERT special is, apart from its good results, the fact that it is trained over billions of records and that Hugging Face provides already a good battery of pre-trained models we can use for different ML tasks. That being said, pretrained BERT is a good tool to use when the language used is clean of typos and is in a “standard, day-to-day” language. import numpy as npimport pandas as pdpd.options.mode.chained_assignment = Nonefrom io import StringIOfrom html.parser import HTMLParserimport reimport nltkfrom nltk.tokenize import word_tokenizenltk.download('punkt')from sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitimport matplotlib.pyplot as pltimport matplotlib.cm as cmimport seaborn as snsimport warningsimport tensorflow as tfimport seaborn as snsfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import ConfusionMatrixDisplayfrom sklearn.metrics import classification_reportfrom sklearn.utils import resamplefrom sentence_transformers import SentenceTransformerfrom sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score Let’s take a quick look at the dataset maindataset = pd.read_csv("Restaurant_reviews.csv")maindataset It is clear that a quick clean up is need. For BERT, since I am going to use a pretrained model I will not remove stopwords, common use words or stem. I use a set of functions to remove “junk” that I have ready to use in NLP class MLStripper(HTMLParser): def init(self): super().init() self.reset() self.strict = False self.convert_charrefs= True self.text = StringIO() def handle_data(self, d): self.text.write(d) def get_data(self): return self.text.getvalue()def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()def preprepare(eingang): ausgang = striptags(eingang) ausgang = eingang.lower() ausgang = ausgang.replace(u'\xa0', u' ') ausgang = re.sub(r'^\s$',' ',str(ausgang)) ausgang = ausgang.replace('|', ' ') ausgang = ausgang.replace('ï', ' ') ausgang = ausgang.replace('»', ' ') ausgang = ausgang.replace('¿', '. ') ausgang = ausgang.replace('ï»¿', ' ') ausgang = ausgang.replace('"', ' ') ausgang = ausgang.replace("'", " ") ausgang = ausgang.replace('?', ' ') ausgang = ausgang.replace('!', ' ') ausgang = ausgang.replace(',', ' ') ausgang = ausgang.replace(';', ' ') ausgang = ausgang.replace('.', ' ') ausgang = ausgang.replace("(", " ") ausgang = ausgang.replace(")", " ") ausgang = ausgang.replace("{", " ") ausgang = ausgang.replace("}", " ") ausgang = ausgang.replace("[", " ") ausgang = ausgang.replace("]", " ") ausgang = ausgang.replace("~", " ") ausgang = ausgang.replace("@", " ") ausgang = ausgang.replace("#", " ") ausgang = ausgang.replace("$", " ") ausgang = ausgang.replace("%", " ") ausgang = ausgang.replace("^", " ") ausgang = ausgang.replace("&", " ") ausgang = ausgang.replace("", " ") ausgang = ausgang.replace("<", " ") ausgang = ausgang.replace(">", " ") ausgang = ausgang.replace("/", " ") ausgang = ausgang.replace("\", " ") ausgang = ausgang.replace("`", " ") ausgang = ausgang.replace("+", " ") ausgang = ausgang.replace("=", " ") ausgang = ausgang.replace("", " ") ausgang = ausgang.replace("-", " ") ausgang = ausgang.replace(':', ' ') ausgang = ausgang.replace('\n', ' ').replace('\r', ' ') ausgang = ausgang.replace(" +", " ") ausgang = ausgang.replace(" +", " ") ausgang = ausgang.replace('?', ' ') ausgang = re.sub('[^a-zA-Z]', ' ', ausgang) ausgang = re.sub(' +', ' ', ausgang) ausgang = re.sub('\ +', ' ', ausgang) ausgang = re.sub(r'\s(?.!")', r'\1', ausgang) return ausgangmaindataset["NLPtext"] = maindataset["Review"]maindataset["NLPtext"] = maindataset["NLPtext"].str.lower()maindataset["NLPtext"] = maindataset["NLPtext"].apply(lambda x: preprepare(str(x))) There is an extensive list of pre-trained models for BERT. In our case we are looking to do classification/regression, and we are working with uncased data (Analysis = analysis). I went in for a powerful (based on 1 billion training pairs) general-use model You can see a list of the available pretrained model on the main page of the Sentence Transformer package: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html bertmodel = SentenceTransformer('all-mpnet-base-v2') And that’s it, as simple as that. I produce the embeddings for the reviews based on this downloaded model reviews_embedding = bertmodel.encode(maindataset["NLPtext"]) Final prep of the training set, normalize and show the distribution emb = pd.DataFrame(reviews_embedding)emb.index = maindataset.indexdef properscaler(simio): scaler = StandardScaler() resultsWordstrans = scaler.fit_transform(simio) resultsWordstrans = pd.DataFrame(resultsWordstrans) resultsWordstrans.index = simio.index resultsWordstrans.columns = simio.columns return resultsWordstransemb = properscaler(emb)emb['rating'] = pd.to_numeric(maindataset['Rating'], errors='coerce')emb = emb.dropna()sns.displot(emb['rating']) I go ahead and split the set in train and test outp = train_test_split(emb, train_size=0.7)finaleval=outp[1]subset=outp[0]x_subset = subset.drop(columns=["rating"]).to_numpy()y_subset = subset['rating'].to_numpy()x_finaleval = finaleval.drop(columns=["rating"]).to_numpy()y_finaleval = finaleval[['rating']].to_numpy() Using Keras I prepared a simple neural network for regression. This means, no final activation function. After several runs, this is the best configuration found in terms of activation functions and number of neural units. #initializeneur = tf.keras.models.Sequential()#layersneur.add(tf.keras.layers.Dense(units=150, activation='relu'))neur.add(tf.keras.layers.Dense(units=250, activation='sigmoid'))neur.add(tf.keras.layers.Dense(units=700, activation='tanh'))#output layer / no activation for output of regressionneur.add(tf.keras.layers.Dense(units=1, activation=None))#using mse for regression. Simple and clearneur.compile(loss='mse', optimizer='adam', metrics=['mse'])#trainneur.fit(x_subset, y_subset, batch_size=5000, epochs=1000) Predict on the test data test_out = neur.predict(x_finaleval) This step might not be necessary, but I do it in order to be sure that the data is between 1 and 5, as it is in the original set output = outp[1][[0]]scal = MinMaxScaler(feature_range=(1,5))output['predicted'] = scal.fit_transform(test_out)output['actual'] = y_finalevaloutput = output.drop(columns=[0])output = pd.merge(output, maindataset[['Review']], left_index=True, right_index=True)output = output.sort_values(['predicted'], ascending=False)pd.options.display.max_colwidth = 150output The results look ok at high level. Now I will examine the stats of the regression from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_scoreprint("R2: ", r2_score(output['actual'], output['predicted']))print("MeanSqError: ",np.sqrt(mean_squared_error(output['actual'], output['predicted'])))print("MeanAbsError: ", mean_absolute_error(output['actual'],output['predicted'])) Which are actually not very good. But we can still save face here, this is when knowing the use case is important. Since the main issue is extracting the bad reviews, so I will proceed to mark those under 2.5 in the scale (1 and 2 in the original) as Bad reviews and leave the rest apart. output["RangePredicted"] = np.where(output['predicted']<=2.5,"1.Bad","2.Other")output["RangeActual"] = np.where(output['actual']<=2.5,"1.Bad","2.Other")ConfusionMatrixDisplay.from_predictions(y_true=output['RangeActual'] ,y_pred=output['RangePredicted'] , cmap='PuBu') And the model performs very well split good from bad reviews. This type of issue sometimes appears in multiclass classification. The solution is in many cases to split the dataset differently to reduce the number of classes, and if required do a second training and inference on the previously classified datasets to “drilldown” into the original classes. In this example, this is not necessary. At this point, the bad reviews can be: 1) further classified using a clustering algorithm, 2) given to the customer service department so that they can run their analysis or explain what can be improved. Join thousands of data leaders on the AI newsletter. Join […]

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签