Another little bit of NLP showing how to do a quick and dirty sentiment analysis utilizing VADER within NLTK. It doesn’t give the best accuracy on all datasets, but it removes complexity. Details below in the Jupyter Notebook.

Natural Language Processing in Python

Sentiment Utilizing VADER

This is a rather simple example of how to perform a Sentiment Analysis utilizing the NLTK library, and the pre-trained VADER model. The advantage of this model is that it is very simple and saves you from needing to perform train/test splits and fitting models yourself. However, VADER isn't the best in every situation as it may not perform well against all text (in which case a custom model may be better).

The dataset we'll be using is the built-in movie-reviews IMDB database in NLTK (2,000 records). In this case, it's been exported to a delimited file here.

In [3]:
# Imports
import numpy as np
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
In [4]:
df = pd.read_csv('Datasets/moviereviews.tsv', sep='\t')
df.head()
Out[4]:
label review
0 neg how do films like mouse hunt get into theatres...
1 neg some talented actresses are blessed with a dem...
2 pos this has been an extraordinary year for austra...
3 pos according to hollywood movies made in last few...
4 neg my first press screening of 1998 and already i...
In [5]:
# Instantiate Sentiment Analysis Analyzer Object
sid = SentimentIntensityAnalyzer()
In [6]:
# Manual example
review1 = 'This is the BEST movie I have ever seen!!!'
review2 = 'This is the WORST movie in the entire world, and I hate it SO much.'
In [7]:
# Manual example scores
# The compound score shows this review is mostly positive (near to 1)
sid.polarity_scores(review1)
Out[7]:
{'neg': 0.0, 'neu': 0.546, 'pos': 0.454, 'compound': 0.7788}
In [8]:
# The compound score shows this review is mostly negative (near to -1)
sid.polarity_scores(review2)
Out[8]:
{'neg': 0.416, 'neu': 0.584, 'pos': 0.0, 'compound': -0.8602}

Data Clean / Mini-EDA

In [9]:
# Check for NULL
df.isnull().sum()
Out[9]:
label      0
review    35
dtype: int64
In [10]:
# Drop NULL
df.dropna(inplace=True)
In [11]:
# Get rid of blanks / check for blanks
blanks = []

for i,lb,rv in df.itertuples():
    #index, label, review
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
In [12]:
len(blanks)
Out[12]:
27
In [13]:
df.drop(blanks,inplace=True)
In [14]:
# Example review text from a single review
df.iloc[0]['review']
Out[14]:
'how do films like mouse hunt get into theatres ? \r\nisn\'t there a law or something ? \r\nthis diabolical load of claptrap from steven speilberg\'s dreamworks studio is hollywood family fare at its deadly worst . \r\nmouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . \r\nwriter adam rifkin and director gore verbinski are the names chiefly responsible for this swill . \r\nthe plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . \r\ndeciding to check out the long-abandoned house , they soon learn that it\'s worth a fortune and set about selling it in auction to the highest bidder . \r\nbut battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . \r\nthe story alternates between unfunny scenes of the brothers bickering over what to do with their inheritance and endless action sequences as the two take on their increasingly determined furry foe . \r\nwhatever promise the film starts with soon deteriorates into boring dialogue , terrible overacting , and increasingly uninspired slapstick that becomes all sound and fury , signifying nothing . \r\nthe script becomes so unspeakably bad that the best line poor lee evens can utter after another run in with the rodent is : " i hate that mouse " . \r\noh cringe ! \r\nthis is home alone all over again , and ten times worse . \r\none touching scene early on is worth mentioning . \r\nwe follow the mouse through a maze of walls and pipes until he arrives at his makeshift abode somewhere in a wall . \r\nhe jumps into a tiny bed , pulls up a makeshift sheet and snuggles up to sleep , seemingly happy and just wanting to be left alone . \r\nit\'s a magical little moment in an otherwise soulless film . \r\na message to speilberg : if you want dreamworks to be associated with some kind of artistic credibility , then either give all concerned in mouse hunt a swift kick up the arse or hire yourself some decent writers and directors . \r\nthis kind of rubbish will just not do at all . \r\n'

Calculate Polarity of each review

In [15]:
# Insert scores for each
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
In [16]:
# Pull out compound score (what we really look at)
df['compound'] = df['scores'].apply(lambda d:d['compound'])
In [17]:
# Assign pos/neg text to each compound score
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')
In [18]:
df.head()
Out[18]:
label review scores compound comp_score
0 neg how do films like mouse hunt get into theatres... {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co... -0.9125 neg
1 neg some talented actresses are blessed with a dem... {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com... -0.8618 neg
2 pos this has been an extraordinary year for austra... {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com... 0.9953 pos
3 pos according to hollywood movies made in last few... {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co... 0.9972 pos
4 neg my first press screening of 1998 and already i... {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com... -0.7264 neg
In [19]:
# Calculate accuracy of the VADER analysis
# Not GREAT, only 'OK', shows why a custom model may be better
accuracy_score(df['label'],df['comp_score'])
Out[19]:
0.6367389060887513
In [20]:
# Classification Report
print(classification_report(df['label'],df['comp_score']))
              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

   micro avg       0.64      0.64      0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938

In [21]:
# Confusion Matrix
print(confusion_matrix(df['label'],df['comp_score']))
[[427 542]
 [162 807]]