NLP is a huge deal recently, and quite “easy” to do on a basic level within Python. In this exercise I completed, we’ll show how to classify yelp reviews both with and without text pre-processing. Interesting to note that the pre-processing actually didn’t help us here.

Natural Language Processing Project – Simple Example

In this NLP project we will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews.

We will use the Yelp Review Data Set from Kaggle.

Each observation in this dataset is a review of a particular business by a particular user.

The “stars” column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.

The “cool” column is the number of “cool” votes this review received from other Yelp users.

All reviews start with 0 “cool” votes, and there is no limit to how many “cool” votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.

The “useful” and “funny” columns are similar to the “cool” column.

Let’s get started! Just follow the directions below!

Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The Data

Read the yelp.csv file and set it as a dataframe called yelp.

In [2]:
yelp = pd.read_csv('yelp.csv')
In [3]:
yelp.head()
Out[3]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf… review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review… review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als… review 0hT2KtfLiobPvh6cDC8JQg 0 1 0
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!… review uZetl9T0NcROGOyFfughhg 1 2 0
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!… review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0

Create a new column called “text length” which is the number of words in the text column.

In [7]:
yelp['text length'] = yelp['text'].apply(len)
In [8]:
yelp.head()
Out[8]:
business_id date review_id stars text type user_id cool useful funny text length
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf… review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 889
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review… review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 1345
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als… review 0hT2KtfLiobPvh6cDC8JQg 0 1 0 76
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!… review uZetl9T0NcROGOyFfughhg 1 2 0 419
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!… review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 469

Data Exploration

Let’s explore the data

Use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings.

In [12]:
sns.set_style('whitegrid')
g = sns.FacetGrid(yelp, col="stars")
g = g.map(plt.hist, "text length")

Create a boxplot of text length for each star category.

In [13]:
ax = sns.boxplot(x="stars", y="text length", data=yelp)

Create a countplot of the number of occurrences for each type of star rating.

In [15]:
sns.countplot(x='stars',data=yelp)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x2009a65e808>

Use the corr() method on the mean of ‘stars’within the dataframe:

In [27]:
corr_matrix = yelp.groupby('stars').mean().corr()
In [28]:
corr_matrix
Out[28]:
cool useful funny text length
cool 1.000000 -0.743329 -0.944939 -0.857664
useful -0.743329 1.000000 0.894506 0.699881
funny -0.944939 0.894506 1.000000 0.843461
text length -0.857664 0.699881 0.843461 1.000000

Then use seaborn to create a heatmap based off that .corr() dataframe:

In [89]:
sns.heatmap(corr_matrix,annot=True, cmap="coolwarm")
Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x2009ca297c8>

NLP Classification Task

Let’s move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars.

Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews.

In [51]:
yelp_class = yelp[(yelp['stars'] == 1) | (yelp['stars'] == 5)]

Create two objects X and y. X will be the ‘text’ column of yelp_class and y will be the ‘stars’ column of yelp_class. (Your features and target/labels)

In [52]:
X = yelp_class['text']
y = yelp_class['stars']

Import CountVectorizer and create a CountVectorizer object.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

Use the fit_transform method on the CountVectorizer object and pass in X (the ‘text’ column). Save this result by overwriting X.

In [54]:
X = cv.fit_transform(X)

Train Test Split

Let’s split our data into training and testing data.

Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101

In [62]:
#Import splitting library
from sklearn.model_selection import train_test_split

#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Training a Model

Time to train a model!

Import MultinomialNB and create an instance of the estimator and call is nb

In [69]:
from sklearn.naive_bayes import MultinomialNB

Now fit nb using the training data.

In [70]:
nb = MultinomialNB().fit(X_train, y_train)

Predictions and Evaluations

Time to see how our model did!

Use the predict method off of nb to predict labels from X_test.

In [72]:
predictions = nb.predict(X_test)

Create a confusion matrix and classification report using these predictions and y_test

In [73]:
from sklearn.metrics import classification_report, confusion_matrix
In [74]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test, predictions))
[[159  69]
 [ 22 976]]
              precision    recall  f1-score   support

           1       0.88      0.70      0.78       228
           5       0.93      0.98      0.96       998

    accuracy                           0.93      1226
   macro avg       0.91      0.84      0.87      1226
weighted avg       0.92      0.93      0.92      1226

Great! Let’s see what happens if we try to include TF-IDF to this process using a pipeline.

Using Text Processing

Let’s see if we can improve our model by performing some text processing

Import TfidfTransformer from sklearn.

In [75]:
from sklearn.feature_extraction.text import TfidfTransformer

Import Pipeline from sklearn.

In [76]:
from sklearn.pipeline import Pipeline

Now create a pipeline with the following steps:CountVectorizer(), TfidfTransformer(),MultinomialNB()

In [80]:
pipeline = Pipeline([
    ('process',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('classifier',MultinomialNB())
])

Using the Pipeline

Time to use the pipeline! Remember this pipeline has all your pre-process steps in it already, meaning we’ll need to re-split the original data (Remember that we overwrote X as the CountVectorized version. What we need is just the text

Train Test Split

Redo the train test split on the yelp_class object.

In [85]:
X = yelp_class['text']
y = yelp_class['stars']

#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Now fit the pipeline to the training data. Remember you can’t use the same training data as last time because that data has already been vectorized. We need to pass in just the text and labels

In [86]:
pipeline.fit(X_train,y_train)
Out[86]:
Pipeline(steps=[('process', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('classifier', MultinomialNB())])

Predictions and Evaluation

Now use the pipeline to predict from the X_test and create a classification report and confusion matrix. You should notice strange results.

In [87]:
predictions = pipeline.predict(X_test)
In [88]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test, predictions))
[[  0 228]
 [  0 998]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00       228
           5       0.81      1.00      0.90       998

    accuracy                           0.81      1226
   macro avg       0.41      0.50      0.45      1226
weighted avg       0.66      0.81      0.73      1226

C:\Users\kaled\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Looks like Tf-Idf actually made things worse! That’s it for this project.

WordPress conversion from NLP Project Example.ipynb by nb2wp v0.3.1