Another little bit of NLP showing how to do a quick and dirty sentiment analysis utilizing VADER within NLTK. It doesn’t give the best accuracy on all datasets, but it removes complexity. Details below in the Jupyter Notebook.
Natural Language Processing in Python¶
Sentiment Utilizing VADER¶
This is a rather simple example of how to perform a Sentiment Analysis utilizing the NLTK library, and the pre-trained VADER model. The advantage of this model is that it is very simple and saves you from needing to perform train/test splits and fitting models yourself. However, VADER isn't the best in every situation as it may not perform well against all text (in which case a custom model may be better).
The dataset we'll be using is the built-in movie-reviews IMDB database in NLTK (2,000 records). In this case, it's been exported to a delimited file here.
# Imports
import numpy as np
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
df = pd.read_csv('Datasets/moviereviews.tsv', sep='\t')
df.head()
# Instantiate Sentiment Analysis Analyzer Object
sid = SentimentIntensityAnalyzer()
# Manual example
review1 = 'This is the BEST movie I have ever seen!!!'
review2 = 'This is the WORST movie in the entire world, and I hate it SO much.'
# Manual example scores
# The compound score shows this review is mostly positive (near to 1)
sid.polarity_scores(review1)
# The compound score shows this review is mostly negative (near to -1)
sid.polarity_scores(review2)
Data Clean / Mini-EDA¶
# Check for NULL
df.isnull().sum()
# Drop NULL
df.dropna(inplace=True)
# Get rid of blanks / check for blanks
blanks = []
for i,lb,rv in df.itertuples():
#index, label, review
if type(rv) == str:
if rv.isspace():
blanks.append(i)
len(blanks)
df.drop(blanks,inplace=True)
# Example review text from a single review
df.iloc[0]['review']
Calculate Polarity of each review¶
# Insert scores for each
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
# Pull out compound score (what we really look at)
df['compound'] = df['scores'].apply(lambda d:d['compound'])
# Assign pos/neg text to each compound score
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')
df.head()
# Calculate accuracy of the VADER analysis
# Not GREAT, only 'OK', shows why a custom model may be better
accuracy_score(df['label'],df['comp_score'])
# Classification Report
print(classification_report(df['label'],df['comp_score']))
# Confusion Matrix
print(confusion_matrix(df['label'],df['comp_score']))