A recent project I completed related to Natural Language Processing – embedded below as a Jupyter Notebook.
Sentiment Analysis of User Comments on Political Subreddits¶
Author: Mike Kale
For this Project I'll be analyzing sentiment of various comments within the Social News website Reddit.com. The goal will be to analyze how sentiment for user comments are affected by political ideology. My source data will be comments found on Reddit.com.
Test #1
- Pull all comments from 100 stories from three different political subreddits (Progressive, Conservative, NeutralPolitics)
- Run sentiment analysis on all three sets
- Compare sentiment percentage from each
Test #2
- Pull top-level comments from 100 stories from three different political subreddits (Progressive, Conservative, NeutralPolitics)
- Run sentiment analysis on all three sets
- Compare sentiment percentage from each
Test #2
- Pull same story from all three subreddits
- Run sentiment analysis on all three sets
- Compare sentiment percentage from each
Some algorithms are altered from a great tutorial found here: www.stackovercloud.com/2019/09/27/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk/
Reddit Connection¶
# Connection to Reddit's API
import pandas as pd
import praw
# I've had to clear this section out since it contains person information
# To run this code, the person would need to go to Reddit.com and register
# a new application, which would provide an API key to input below.
reddit = praw.Reddit(
client_id="<fill me out>",
client_secret="<fill me out>",
user_agent="SentimentTest/0.0.1",
)
print(reddit.read_only)
# Output: True if successful connection
NLP / Sentiment Analysis Model¶
The following algorithm is used to cleanse the comments into a format that can be classified by a NLP model.
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk import FreqDist, classify, NaiveBayesClassifier
import re, string, random
tweet_tknzr = TweetTokenizer() #Use this over word_tokenize because of contractions
def remove_noise(tokens):
"""
1. Removes 'noise' such as HTML/URL/Emojis to perform analysis
2. Lemmatize/Normalize like words
3. Remove English stop words
"""
cleaned_tokens = []
stop_words = stopwords.words('english')
for token, tag in pos_tag(tokens):
token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)
if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)
if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens
def get_all_words(cleaned_tokens_list):
# Return all words from cleaned token list
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
def get_tokens_for_model(cleaned_tokens_list):
# Return all tokens for model classification
for tokens in cleaned_tokens_list:
yield dict([token, True] for token in tokens)
Train Model¶
"""
Utilize Positive / Negative tweets that come packaged with
the NLTK to help train the model for classification. This consists
of a total of 14,000 tweets - 7000 positive and 7000 negative.
"""
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
#Run through cleaning algorithm
for tokens in positive_tweet_tokens:
positive_cleaned_tokens_list.append(remove_noise(tokens))
#Run through cleaning algorithm
for tokens in negative_tweet_tokens:
negative_cleaned_tokens_list.append(remove_noise(tokens))
#Test to show words leftover
#all_pos_words = get_all_words(positive_cleaned_tokens_list)
# Return tokens for model
positive_tokens_for_model = get_tokens_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tokens_for_model(negative_cleaned_tokens_list)
# Form dataset
positive_dataset = [(tweet_dict, "Positive")
for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative")
for tweet_dict in negative_tokens_for_model]
dataset = positive_dataset + negative_dataset
# Shuffle to randomize
random.shuffle(dataset)
# Break into training / testing datasets
train_data = dataset[:7000]
test_data = dataset[7000:]
# Form classifier model
classifier = NaiveBayesClassifier.train(train_data)
# Print accuracy based on test data
print("Accuracy is:", classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(10))
With an accuracy of over 99.5%, I'm very confident of the model.
Quick Manual Example of Model¶
results = []
sample_1 = 'I really like this idea, good job Steve!'
sample_2 = 'This is the worst idea I have ever heard of, stop talking to me!'
custom_tokens_1 = remove_noise(tweet_tknzr.tokenize(sample_1))
custom_tokens_2 = remove_noise(tweet_tknzr.tokenize(sample_2))
classifier.classify(dict([token, True] for token in custom_tokens_1))
classifier.classify(dict([token, True] for token in custom_tokens_2))
Test 1 - All Comment Analyis¶
- Pull all comments from 100 stories from three different political subreddits (Progressive, Conservative, NeutralPolitics)
- Run sentiment analysis on all three sets
- Compare sentiment percentage from each
Test 1 - Data Pull¶
# Assign subreddits and print out some basic info to test
subreddits = [reddit.subreddit("Liberal"),reddit.subreddit("Conservative"),reddit.subreddit("NeutralPolitics")]
for subreddit in subreddits:
print(subreddit.display_name)
print(subreddit.description)
Sample output (truncated due to length):
Liberal **Welcome to /r/Liberal!** **Submission Guidelines** * Do not submit pictures * Do not submit videos * Do not submit memes...
# Grab ID's for 100 stories in each of the three subreddits (the "hot" 100)
story_ids = {}
for subreddit in subreddits:
for submission in subreddit.hot(limit=100):
#Loop through comment hierarchy
while True:
try:
submission.comments.replace_more(limit=None) #Remove 'show more comments'
break
except PossibleExceptions:
print("Handling replace_more exception")
sleep(1)
story_ids[submission.id] = [subreddit.display_name,submission.title,submission.comments.list()] #Put comments in dictionary
# These ID's allow us to pull the sentiment for each individual story per subreddit
for key, value in story_ids.items():
print(key, '->', value)
Sample output (truncated due to length):
mrmrmn -> ['Liberal', 'Nancy Pelosi Reveals Mitch McConnell Blocked Ruth Bader Ginsburg From Getting Capitol Rotunda Memorial', [Comment(id='guni694'), Comment(id='gun9kzd'), Comment(id='gun6g67'), Comment(id='gunrb5k'), Comment(id='guo43lv'), Comment(id='gun9pts'), Comment(id='gunzaij'), Comment(id='gunx3hm'), Comment(id='guo42tq'), Comment(id='guntl7b'), Comment(id='guo4i4j'), Comment(id='guo8isc'), Comment(id='guo51mi'), Comment(id='gunifpa'), Comment(id='gunp802'), Comment(id='gunrykp'), Comment(id='guoebi5'), Comment(id='guniiqg')]] mre2ul -> ['Liberal', 'US jobless claims plunge to 576,000, lowest since pandemic', [Comment(id='gult7hk'), Comment(id='guls41x'), Comment(id='gulldj6'), Comment(id='gunkas3'), Comment(id='gulu5i6'), Comment(id='gum344z'), Comment(id='gunbvb6'), Comment(id='gunkifd'), Comment(id='gunlj54'), Comment(id='gulv1lr'), Comment(id='gum4zjx'), Comment(id='gum21x4'), Comment(id='gumk6bn'), Comment(id='gumbkei'), Comment(id='guluu0u'), Comment(id='gulunt8'), Comment(id='gum8l81'), Comment(id='gumtfmq'), Comment(id='gum2ol0'), Comment(id='guluw2o')]]
# Dump comments into a pandas dataframe
df_rows = []
for key, value in story_ids.items():
# Loop through comment lists in value[2]
for comment in value[2]:
df_rows.append([key, value[0], value[1], comment.id, comment.score, comment.created, comment.body])
df = pd.DataFrame(df_rows, columns=['Post ID', 'Subreddit', 'Post Title', 'Comment ID', 'Score', 'Created', 'Body'])
df = df[(df['Body'] == '[deleted]') | (df['Body'] == '[removed]')==False].reset_index(drop=True) #Exclude any deleted/removed comments
df.head()
# Here we can see we have a total of ~18k comments
df.info()
Test 1 - EDA¶
- Analyze all comments from 100 stories for sentiment, grouped by subreddit.
- Then, analyze all comments from 100 stories for sentiment, group by subreddit AND story ID (then aggregate).
# Gather sentiment for every comment and dump in list
custom_tokens = []
results = []
num_pos = 0
num_neg = 0
per_pos = 0.0
per_neg = 0.0
for index, row in df.iterrows():
custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
results.append([row['Post ID'], row['Subreddit'], row['Body'], classifier.classify(dict([token, True] for token in custom_tokens[index]))])
# Analyze sentiment gathered for all comments on all stories by subreddit
# Display percentage
df_r = pd.DataFrame([result[1:] for result in results], columns=['Subreddit','Body','Sentiment'])
df_r_p = df_r[['Subreddit','Sentiment']].pivot_table(index='Subreddit', columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
# Sampling of Pos/Neg Comments
df_r[df_r['Sentiment'] == 'Negative']['Body'].tolist() #Negative
Sample output (truncated due to length):
['What?! Holy shit!', 'Does it really need to be revealed that Mitch McConnell is a cunt ?', 'when fuck knuckle McConnell dies, please make sure that his coffin is proudly displayed in a confederate flag draped dumpster', 'And yet Democrats still pretend they can work with these political arsonists.', 'Dick', 'Nancy is a gross too.', "Just horrible. I wasn't aware of this.", 'No, but it needs to be said. Repeatedly.', 'Nope.\n\n\nThe cumulative analysis of everything he has done and said to hurt everyone in the USA?\n\n\nAnd the support McConnell has shown disgraced insurrectionist and delusional one term former President Trump __(the only President to be impeached twice)__ is unfathomable.\n\n\n\n.',
df_r[df_r['Sentiment'] == 'Positive']['Body'].tolist() #Positive
Sample output (truncated due to length):
['Just horrible.\n\n\n\n>“Mitch McConnell is not a force for good in our country,” Pelosi told me. “He is an enabler of some of the worst stuff, and an instigator of some of it on his own.” The two congressional leaders had never had a particularly good relationship. Now there was bitterness from a new dispute between them, one not reported at the time. When Supreme Court justice Ruth Bader Ginsburg died in September, Pelosi proposed that the groundbreaking feminist lie in state in the Capitol Rotunda. She would have been the first woman in history to be so honored.\n\n>McConnell rejected the idea on the grounds that there was no precedent for such treatment of a justice. When William Howard Taft had lain in state in 1930, he had been not only the chief justice but also president, McConnell noted.\n\n>He wasn’t swayed by the argument that Ginsburg had achieved an iconic status in American culture, especially for women and girls. McConnell’s refusal meant that Ginsburg’s flag-draped coffin was placed not in the Rotunda, which connects the House and Senate, but in Statuary Hall, on the House side.\n\n>McConnell and House Republican leader Kevin McCarthy didn’t accept invitations to attend the service for her.', 'Because of Course he did.',
Although we see a good breakdown by all comments, this may not account for the fact that a story could contain all positive or all negative comments - perhaps a better method of analysis would be to group the stories by ID first and obtain the percentages.
# Analyze sentiment gathered for all comments on all stories by subreddit AND story ID
# Display percentage
df_r = pd.DataFrame(results, columns=['Post ID','Subreddit','Body','Sentiment'])
df_r_p = df_r[['Post ID','Subreddit','Sentiment']].pivot_table(index=['Subreddit','Post ID'], columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
# Group by Subreddit and take average of averages
df_r_p.groupby('Subreddit').agg({'Negative':'sum',
'Positive':'sum',
'% Negative':'mean',
'% Positive':'mean'}).round(decimals=2)
Test 1 - Conclusion¶
In testing through several iterations - I've found that there tends to be a fairly direct lean towards negative in the top 100 (sometimes more than others).
The only problem with taking a percentage grouped by only subreddit is that is ignores the story id's and therefore may produce skewed numbers in which a story could be all positive or all negative.
We can try to account for this by including the story ID in the groupby and take an average of averages. Doing so highlights that the Conservative subreddit tends to skew more towards a more negative sentiment, while Liberal and NeutralPolitics subreddits tend to sit more in the middle (although we're dealing in very small percentages here).
Another way to put it, is that for every negative comment posted on a given story in Liberal/NeutralPolitics - there appears to be at least one positive comment as well when normalizing by grouping by Post ID first.
However - when viewing what the model claims is positive/negative, we can easily pick out quite a few items that most would have thrown into the negative column instead of positive. This may go both ways, but I have a feeling that the negative skew should be even higher. With time to train using other data we might be able to produce more accurate results.
Test 2 - Top-level Analysis Only¶
With all of these data pulls we've been making the assumption that we should pull the entirety of all comments on a story and NOT only the top-level comments. The only problem this represents is that we may now be pulling large threads of discussion in which folks could be arguing back and forth on the subject, thereby skewing negativity. If we try only pulling the top-level comments, perhaps we can get a better sense of the initial reaction from each user.
- Analyze top-level comments from 100 stories for sentiment, grouped by subreddit.
- Then, analyze top-level comments from 100 stories for sentiment, group by subreddit AND story ID (then aggregate).
Test 2 - Data Pull¶
# Assign subreddits and print out some basic info to test
subreddits = [reddit.subreddit("Liberal"),reddit.subreddit("Conservative"),reddit.subreddit("NeutralPolitics")]
# Grab ID's for 100 stories in each subreddit (the "hot" 100)
story_ids = {}
for subreddit in subreddits:
for submission in subreddit.hot(limit=100):
#Loop through comment hierarchy and pull all levels
while True:
try:
submission.comments.replace_more(limit=None) #Remove 'show more comments'
break
except PossibleExceptions:
print("Handling replace_more exception")
sleep(1)
#Pull only top-level comments
story_ids[submission.id] = [subreddit.display_name,submission.title,[top_level_comment for top_level_comment in submission.comments]] #Put in dictionary
# These ID's allow us to pull the sentiment for each individual story per subreddit
for key, value in story_ids.items():
print(key, '->', value)
Sample output (truncated due to length):
mrmrmn -> ['Liberal', 'Nancy Pelosi Reveals Mitch McConnell Blocked Ruth Bader Ginsburg From Getting Capitol Rotunda Memorial', [Comment(id='guni694'), Comment(id='gun9kzd'), Comment(id='gun6g67'), Comment(id='gunrb5k'), Comment(id='guo43lv'), Comment(id='gun9pts'), Comment(id='gunzaij'), Comment(id='guo42tq'), Comment(id='gunx3hm'), Comment(id='guntl7b'), Comment(id='guo4i4j'), Comment(id='guo8isc'), Comment(id='guo51mi')]] mre2ul -> ['Liberal', 'US jobless claims plunge to 576,000, lowest since pandemic', [Comment(id='gult7hk'), Comment(id='guls41x'), Comment(id='gulldj6'), Comment(id='gunkas3'), Comment(id='gulu5i6')]] mrbjf5 -> ['Liberal', 'Democrats to introduce legislation to expand Supreme Court', [Comment(id='gum7bf4'), Comment(id='gum2scd'), Comment(id='gummtcu'), Comment(id='gum40vp'), Comment(id='gultr3g'), Comment(id='gulykpg'), Comment(id='guls3mm'), Comment(id='gune3t6'), Comment(id='gunk109'), Comment(id='gunts3x'), Comment(id='guo4vo6'), Comment(id='gumx8eh')]] mrqzy2 -> ['Liberal', 'Nice! Schumer lays groundwork for future filibuster reform', []] mrqbbo -> ['Liberal', 'Biden: ‘If Russia continues to interfere with our democracy, I’m prepared to take further actions’', []] mrf49t -> ['Liberal', 'Biden Administration to Impose Tough Sanctions on Russia', [Comment(id='gum3jcu')]] mrqjr4 -> ['Liberal', "WATCH: Maxine Waters erupts at Jim Jordan and tells him to 'respect the chair and shut your mouth' during COVID-19 hearing", [Comment(id='guo6h7f'), Comment(id='gunvd63')]]
# Dump comments into a pandas dataframe
df_rows = []
for key, value in story_ids.items():
# Loop through comment lists in value[2]
for comment in value[2]:
df_rows.append([key, value[0], value[1], comment.id, comment.score, comment.created, comment.body])
df = pd.DataFrame(df_rows, columns=['Post ID', 'Subreddit', 'Post Title', 'Comment ID', 'Score', 'Created', 'Body'])
df = df[(df['Body'] == '[deleted]') | (df['Body'] == '[removed]')==False].reset_index(drop=True) #Exclude any deleted/removed comments
df.head()
# Very limited set now at 2890 comments, compared to 18k
df.info()
Test 2 - EDA¶
- Analyze all top-level comments from 100 stories for sentiment, grouped by subreddit.
- Then, analyze all top-level comments from 100 stories for sentiment, group by subreddit AND story ID (then aggregate).
# Gather sentiment for every comment and dump in list
custom_tokens = []
results = []
num_pos = 0
num_neg = 0
per_pos = 0.0
per_neg = 0.0
for index, row in df.iterrows():
custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
results.append([row['Post ID'], row['Subreddit'], row['Body'], classifier.classify(dict([token, True] for token in custom_tokens[index]))])
# Analyze sentiment gathered for top-level comments on all stories by subreddit
# Display percentage
df_r = pd.DataFrame([result[1:] for result in results], columns=['Subreddit','Body','Sentiment'])
df_r_p = df_r[['Subreddit','Sentiment']].pivot_table(index='Subreddit', columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
# Analyze sentiment gathered for top-level comments on all stories by subreddit AND story ID
# Display percentage
df_r = pd.DataFrame(results, columns=['Post ID','Subreddit','Body','Sentiment'])
df_r_p = df_r[['Post ID','Subreddit','Sentiment']].pivot_table(index=['Subreddit','Post ID'], columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
# Group by Subreddit and take average of averages
df_r_p.groupby('Subreddit').agg({'Negative':'sum',
'Positive':'sum',
'% Negative':'mean',
'% Positive':'mean'}).round(decimals=2)
Test 2 - Conclusion¶
Surprisingly (at least to me), taking into account only top-level comments shows a huge move towards positive for NeutralPolitics. This seems to indicate that discussion within the subreddit is quite balanced, but initial top-level responses are biased towards positivity.
Test 3¶
- Pull same story from two different subreddits that are ideologically different
- Run sentiment analysis on all sets
- Compare sentiment percentage from each
In this case - I'll take a look at two different stories posted on the same topic within the Conservative subreddit, and the subreddit called "Politics". This subreddit tends to lean center-left naturally due to the user base of Reddit.com. This should at least give us an idea how right vs. left respond to differently or similarly to the same story.
In this case, it's a story detailing the ban of the subreddit "The_Donald", a far-right subreddit that was the center of much controversy. You would assume that the conservative leaning subreddit would produce more negative sentiment, while the center-left would produce more positive sentiment on the banning.
Test 3 - Algorithms¶
def returnAllComments():
'''
Return all comments from a given submission into a DataFrame
'''
#Loop through comment hierarchy and pull all levels
df_rows = []
#Loop through comment hierarchy
while True:
try:
submission.comments.replace_more(limit=None) #Flatten comment tree
break
except PossibleExceptions:
print("Handling replace_more exception")
sleep(1)
comments = submission.comments.list() #Move to list
# Loop through comment lists
for comment in comments:
df_rows.append([comment.id, comment.score, comment.created, comment.body])
df = pd.DataFrame(df_rows, columns=['Comment ID', 'Score', 'Created', 'Body'])
return df[(df['Body'] == '[deleted]') | (df['Body'] == '[removed]')==False].reset_index(drop=True) #Exclude any deleted/removed comments
def runSingleStorySentimentAnalysis(df):
'''
Run sentiment analysis on a given DataFrame and return analysis
'''
# Gather sentiment for every comment and dump in list
custom_tokens = []
results = []
num_pos = 0
num_neg = 0
per_pos = 0.0
per_neg = 0.0
for index, row in df.iterrows():
custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
results.append([classifier.classify(dict([token, True] for token in custom_tokens[index]))])
# Display percentages
df_r = pd.DataFrame([result for result in results], columns=['Sentiment'])
df_r_p = pd.DataFrame(df_r.groupby('Sentiment').size(), columns=['Count'])
df_r_p['% of Total'] = round((df_r_p / df_r_p.sum())*100,2)
return df_r_p
def returnComments(df):
'''
Run sentiment analysis on a given DataFrame and return detail results to review
'''
# Gather sentiment for every comment and dump in list
custom_tokens = []
results = []
num_pos = 0
num_neg = 0
per_pos = 0.0
per_neg = 0.0
for index, row in df.iterrows():
custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
results.append([row['Body'], classifier.classify(dict([token, True] for token in custom_tokens[index]))])
#df_r = pd.DataFrame([result for result in results], columns=['Body','Sentiment'])
return results
Test 3 - Conservative Subreddit¶
# https://www.reddit.com/r/Conservative/comments/hi3u79/reddit_bans_the_donald_forum_as_part_of_major/
submission = reddit.submission(id="hi3u79")
print(submission.title)
df = returnAllComments()
df
runSingleStorySentimentAnalysis(df)
returnComments(df)
Sample output (truncated due to length):
[['R/consumeproduct was banned as well. RIP', 'Negative'], ['It’s been inactive for months lol. Good work reddit. You effectively did nothing.', 'Negative'], ['Lol the press will focus on this matter and mention r/the_donald only because they can up the antitrump ante. The sub was effectively a dead stump of its former self, so its like a beating the dead horse, but well you can always milk it for antitrump hate.', 'Negative'], ['Yet r/politics remains...', 'Negative'], ['I thought this was already taken down?', 'Negative'], ['Major “you don’t fit that narrative” purge. 1984', 'Negative'], ['Official reddit policy now *explicitly* allows hate subreddits/comments/posts against identities in the "majority."', 'Positive'], ['"hate speech"? Someone misspelled "political speech"', 'Negative'], ["Let them ban all the right leaning subs. And then still be shocked when Trump wins the election again.\n\nI cam't wait for the left crying compilation", 'Negative'],
Interesting enough, the sentiment does lean negative - but only by 10 percentage points.
Test 3 - Liberal Subreddit¶
#https://www.reddit.com/r/Liberal/comments/hi4eva/reddit_finally_bans_hate_speech_removes_2000/
submission = reddit.submission(id="hi4eva")
print(submission.title)
df = returnAllComments()
df
runSingleStorySentimentAnalysis(df)
# Show details of how comments were classified
returnComments(df)
Sample output (truncated due to length):
[['Except the Canadian racist cesspool r/metacanada.\n\nStill waiting.......', 'Negative'], ['Republicans: "Businesses should be allowed to reject any customer under any circumstances!" \n \nAlso Republicans: "Reddit shouldn\'t be allowed to reject any customer under any circumstances!"', 'Positive'], ['Genuinely curious what the general consensus is about the part that says it does not protect the majority. Seems discriminatory.\n\n>While the rule on hate protects such groups, it does not protect all groups or all forms of identity. For example, the rule does not protect groups of people who are in the majority', 'Negative'], ['About fucking time.', 'Negative'], ['Words cannot express how fuckin happy I am right now!', 'Negative'], ['Did they ban all the subscribers too or just the reddit?', 'Negative'], ['This isn\'t to all of you, but it\'s to enough of you...\n\nWhat disabled person asked anyone to defend us from words? I don\'t even remember electing a representative for that... Trying to "protect" me from words is actually pretty rude.
Test 3 - Conclusion¶
In this instance - we see a single story that is producing "mostly" negative comments within both subreddits. This result shows that sentiment is highly subjective, and that you cannot always simply infer that something will be seen is more positive or negative just because of the political leanings. In addition, this also shows that sentiment is once again very difficult in context. Although many negative comments from the subreddit Liberal are based upon the fact that censorship is wrong, many comments are classified as negative when in fact the comment being made is agreeing with the ban. These comments are marked as negative simply because of verbiage and not context. Either way, in this case, there still seems to be a lot of common ground being found within all users.