A recent project I completed related to Natural Language Processing – embedded below as a Jupyter Notebook.

Sentiment Analysis of User Comments on Political Subreddits

Author: Mike Kale

For this Project I'll be analyzing sentiment of various comments within the Social News website Reddit.com. The goal will be to analyze how sentiment for user comments are affected by political ideology. My source data will be comments found on Reddit.com.

Test #1

  1. Pull all comments from 100 stories from three different political subreddits (Progressive, Conservative, NeutralPolitics)
  2. Run sentiment analysis on all three sets
  3. Compare sentiment percentage from each

Test #2

  1. Pull top-level comments from 100 stories from three different political subreddits (Progressive, Conservative, NeutralPolitics)
  2. Run sentiment analysis on all three sets
  3. Compare sentiment percentage from each

Test #2

  1. Pull same story from all three subreddits
  2. Run sentiment analysis on all three sets
  3. Compare sentiment percentage from each

Some algorithms are altered from a great tutorial found here: www.stackovercloud.com/2019/09/27/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk/

Reddit Connection

In [1]:
# Connection to Reddit's API
import pandas as pd
import praw

# I've had to clear this section out since it contains person information
# To run this code, the person would need to go to Reddit.com and register
#  a new application, which would provide an API key to input below.
reddit = praw.Reddit(
    client_id="<fill me out>",
    client_secret="<fill me out>",
    user_agent="SentimentTest/0.0.1",
)

print(reddit.read_only)
# Output: True if successful connection
True

NLP / Sentiment Analysis Model

The following algorithm is used to cleanse the comments into a format that can be classified by a NLP model.

In [2]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk import FreqDist, classify, NaiveBayesClassifier

import re, string, random

tweet_tknzr = TweetTokenizer() #Use this over word_tokenize because of contractions

def remove_noise(tokens):
    """
    1. Removes 'noise' such as HTML/URL/Emojis to perform analysis
    2. Lemmatize/Normalize like words
    3. Remove English stop words
    """
    cleaned_tokens = []
    stop_words = stopwords.words('english')

    for token, tag in pos_tag(tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
            
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

def get_all_words(cleaned_tokens_list):
    # Return all words from cleaned token list
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

def get_tokens_for_model(cleaned_tokens_list):
    # Return all tokens for model classification
    for tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tokens)

Train Model

In [3]:
"""
Utilize Positive / Negative tweets that come packaged with
the NLTK to help train the model for classification. This consists
of a total of 14,000 tweets - 7000 positive and 7000 negative.
"""

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

#Run through cleaning algorithm
for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens))

#Run through cleaning algorithm
for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens))

#Test to show words leftover
#all_pos_words = get_all_words(positive_cleaned_tokens_list)

# Return tokens for model
positive_tokens_for_model = get_tokens_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tokens_for_model(negative_cleaned_tokens_list)

# Form dataset
positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

# Shuffle to randomize
random.shuffle(dataset)

# Break into training / testing datasets
train_data = dataset[:7000]
test_data = dataset[7000:]

# Form classifier model
classifier = NaiveBayesClassifier.train(train_data)

# Print accuracy based on test data
print("Accuracy is:", classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(10))
Accuracy is: 0.9956666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2088.6 : 1.0
                      :) = True           Positi : Negati =   1630.9 : 1.0
                follower = True           Positi : Negati =     35.2 : 1.0
                followed = True           Negati : Positi =     28.7 : 1.0
                     bam = True           Positi : Negati =     20.0 : 1.0
                     sad = True           Negati : Positi =     19.7 : 1.0
                     x15 = True           Negati : Positi =     16.6 : 1.0
                    miss = True           Negati : Positi =     16.0 : 1.0
                      aw = True           Negati : Positi =     13.2 : 1.0
                  arrive = True           Positi : Negati =     13.1 : 1.0
None

With an accuracy of over 99.5%, I'm very confident of the model.

Quick Manual Example of Model

In [4]:
results = []

sample_1 = 'I really like this idea, good job Steve!'
sample_2 = 'This is the worst idea I have ever heard of, stop talking to me!'

custom_tokens_1 = remove_noise(tweet_tknzr.tokenize(sample_1))
custom_tokens_2 = remove_noise(tweet_tknzr.tokenize(sample_2))

classifier.classify(dict([token, True] for token in custom_tokens_1))
Out[4]:
'Positive'
In [5]:
classifier.classify(dict([token, True] for token in custom_tokens_2))
Out[5]:
'Negative'

Test 1 - All Comment Analyis

  1. Pull all comments from 100 stories from three different political subreddits (Progressive, Conservative, NeutralPolitics)
  2. Run sentiment analysis on all three sets
  3. Compare sentiment percentage from each

Test 1 - Data Pull

In [ ]:
# Assign subreddits and print out some basic info to test
subreddits = [reddit.subreddit("Liberal"),reddit.subreddit("Conservative"),reddit.subreddit("NeutralPolitics")]

for subreddit in subreddits:
    print(subreddit.display_name)
    print(subreddit.description)

Sample output (truncated due to length):

Liberal
**Welcome to /r/Liberal!**

**Submission Guidelines**

* Do not submit pictures
* Do not submit videos
* Do not submit memes...
In [ ]:
# Grab ID's for 100 stories in each of the three subreddits (the "hot" 100)
story_ids = {}

for subreddit in subreddits:
    for submission in subreddit.hot(limit=100):
        
        #Loop through comment hierarchy
        while True:
            try:
                submission.comments.replace_more(limit=None) #Remove 'show more comments'
                break
            except PossibleExceptions:
                print("Handling replace_more exception")
                sleep(1)

        story_ids[submission.id] = [subreddit.display_name,submission.title,submission.comments.list()] #Put comments in dictionary
        
# These ID's allow us to pull the sentiment for each individual story per subreddit
for key, value in story_ids.items():
     print(key, '->', value)

Sample output (truncated due to length):

mrmrmn -> ['Liberal', 'Nancy Pelosi Reveals Mitch McConnell Blocked Ruth Bader Ginsburg From Getting Capitol Rotunda Memorial', [Comment(id='guni694'), Comment(id='gun9kzd'), Comment(id='gun6g67'), Comment(id='gunrb5k'), Comment(id='guo43lv'), Comment(id='gun9pts'), Comment(id='gunzaij'), Comment(id='gunx3hm'), Comment(id='guo42tq'), Comment(id='guntl7b'), Comment(id='guo4i4j'), Comment(id='guo8isc'), Comment(id='guo51mi'), Comment(id='gunifpa'), Comment(id='gunp802'), Comment(id='gunrykp'), Comment(id='guoebi5'), Comment(id='guniiqg')]]
mre2ul -> ['Liberal', 'US jobless claims plunge to 576,000, lowest since pandemic', [Comment(id='gult7hk'), Comment(id='guls41x'), Comment(id='gulldj6'), Comment(id='gunkas3'), Comment(id='gulu5i6'), Comment(id='gum344z'), Comment(id='gunbvb6'), Comment(id='gunkifd'), Comment(id='gunlj54'), Comment(id='gulv1lr'), Comment(id='gum4zjx'), Comment(id='gum21x4'), Comment(id='gumk6bn'), Comment(id='gumbkei'), Comment(id='guluu0u'), Comment(id='gulunt8'), Comment(id='gum8l81'), Comment(id='gumtfmq'), Comment(id='gum2ol0'), Comment(id='guluw2o')]]
In [8]:
# Dump comments into a pandas dataframe
df_rows = []

for key, value in story_ids.items():
    # Loop through comment lists in value[2]
    for comment in value[2]:
        df_rows.append([key, value[0], value[1], comment.id, comment.score, comment.created, comment.body])

df = pd.DataFrame(df_rows, columns=['Post ID', 'Subreddit', 'Post Title', 'Comment ID', 'Score', 'Created', 'Body'])
df = df[(df['Body'] == '[deleted]') | (df['Body'] == '[removed]')==False].reset_index(drop=True) #Exclude any deleted/removed comments
df.head()
Out[8]:
Post ID Subreddit Post Title Comment ID Score Created Body
0 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... guni694 16 1.618551e+09 What?! Holy shit!
1 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... gun9kzd 49 1.618547e+09 Does it really need to be revealed that Mitch ...
2 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... gun6g67 26 1.618546e+09 Just horrible.\n\n\n\n>“Mitch McConnell is not...
3 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... gunrb5k 9 1.618555e+09 Because of Course he did.
4 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... guo43lv 7 1.618562e+09 when fuck knuckle McConnell dies, please make ...
In [9]:
# Here we can see we have a total of ~18k comments
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17921 entries, 0 to 17920
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Post ID     17921 non-null  object 
 1   Subreddit   17921 non-null  object 
 2   Post Title  17921 non-null  object 
 3   Comment ID  17921 non-null  object 
 4   Score       17921 non-null  int64  
 5   Created     17921 non-null  float64
 6   Body        17921 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 980.2+ KB

Test 1 - EDA

  1. Analyze all comments from 100 stories for sentiment, grouped by subreddit.
  2. Then, analyze all comments from 100 stories for sentiment, group by subreddit AND story ID (then aggregate).
In [10]:
# Gather sentiment for every comment and dump in list

custom_tokens = []
results = []
num_pos = 0
num_neg = 0
per_pos = 0.0
per_neg = 0.0

for index, row in df.iterrows():
    custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
    results.append([row['Post ID'], row['Subreddit'], row['Body'], classifier.classify(dict([token, True] for token in custom_tokens[index]))])
In [11]:
# Analyze sentiment gathered for all comments on all stories by subreddit
# Display percentage

df_r = pd.DataFrame([result[1:] for result in results], columns=['Subreddit','Body','Sentiment'])
df_r_p = df_r[['Subreddit','Sentiment']].pivot_table(index='Subreddit', columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
Out[11]:
Sentiment Negative Positive % Negative % Positive
Subreddit
Conservative 3092 2862 51.93 48.07
Liberal 662 599 52.50 47.50
NeutralPolitics 5433 5273 50.75 49.25
In [ ]:
# Sampling of Pos/Neg Comments
df_r[df_r['Sentiment'] == 'Negative']['Body'].tolist() #Negative

Sample output (truncated due to length):

['What?! Holy shit!',
 'Does it really need to be revealed that Mitch McConnell is a cunt ?',
 'when fuck knuckle McConnell dies, please make sure that his coffin is proudly displayed in a confederate flag draped dumpster',
 'And yet Democrats still pretend they can work with these political arsonists.',
 'Dick',
 'Nancy is a gross too.',
 "Just horrible.  I wasn't aware of this.",
 'No, but it needs to be said.  Repeatedly.',
 'Nope.\n\n\nThe cumulative analysis of everything he has done and said to hurt everyone in the USA?\n\n\nAnd the support McConnell has shown disgraced insurrectionist and delusional one term former President Trump __(the only President to be impeached twice)__ is unfathomable.\n\n\n\n.',
In [ ]:
df_r[df_r['Sentiment'] == 'Positive']['Body'].tolist() #Positive

Sample output (truncated due to length):

['Just horrible.\n\n\n\n>“Mitch McConnell is not a force for good in our country,” Pelosi told me. “He is an enabler of some of the worst stuff, and an instigator of some of it on his own.” The two congressional leaders had never had a particularly good relationship. Now there was bitterness from a new dispute between them, one not reported at the time. When Supreme Court justice Ruth Bader Ginsburg died in September, Pelosi proposed that the groundbreaking feminist lie in state in the Capitol Rotunda. She would have been the first woman in history to be so honored.\n\n>McConnell rejected the idea on the grounds that there was no precedent for such treatment of a justice. When William Howard Taft had lain in state in 1930, he had been not only the chief justice but also president, McConnell noted.\n\n>He wasn’t swayed by the argument that Ginsburg had achieved an iconic status in American culture, especially for women and girls. McConnell’s refusal meant that Ginsburg’s flag-draped coffin was placed not in the Rotunda, which connects the House and Senate, but in Statuary Hall, on the House side.\n\n>McConnell and House Republican leader Kevin McCarthy didn’t accept invitations to attend the service for her.',
 'Because of Course he did.',

Although we see a good breakdown by all comments, this may not account for the fact that a story could contain all positive or all negative comments - perhaps a better method of analysis would be to group the stories by ID first and obtain the percentages.

In [14]:
# Analyze sentiment gathered for all comments on all stories by subreddit AND story ID
# Display percentage

df_r = pd.DataFrame(results, columns=['Post ID','Subreddit','Body','Sentiment'])
df_r_p = df_r[['Post ID','Subreddit','Sentiment']].pivot_table(index=['Subreddit','Post ID'], columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
Out[14]:
Sentiment Negative Positive % Negative % Positive
Subreddit Post ID
Conservative mr364d 536 611 46.73 53.27
mr4lz8 39 35 52.70 47.30
mr5564 201 228 46.85 53.15
mrbef8 54 55 49.54 50.46
mrbpev 57 36 61.29 38.71
... ... ... ... ... ...
NeutralPolitics mgyhyz 2 6 25.00 75.00
mm3oyu 197 124 61.37 38.63
mm4eaw 2 2 50.00 50.00
mmuprh 9 7 56.25 43.75
mpaobl 11 6 64.71 35.29

285 rows × 4 columns

In [15]:
# Group by Subreddit and take average of averages
df_r_p.groupby('Subreddit').agg({'Negative':'sum', 
                         'Positive':'sum', 
                         '% Negative':'mean', 
                         '% Positive':'mean'}).round(decimals=2)
Out[15]:
Negative Positive % Negative % Positive
Subreddit
Conservative 3092 2862 52.67 47.33
Liberal 662 599 49.19 50.81
NeutralPolitics 5433 5273 47.95 52.05

Test 1 - Conclusion

In testing through several iterations - I've found that there tends to be a fairly direct lean towards negative in the top 100 (sometimes more than others).

The only problem with taking a percentage grouped by only subreddit is that is ignores the story id's and therefore may produce skewed numbers in which a story could be all positive or all negative.

We can try to account for this by including the story ID in the groupby and take an average of averages. Doing so highlights that the Conservative subreddit tends to skew more towards a more negative sentiment, while Liberal and NeutralPolitics subreddits tend to sit more in the middle (although we're dealing in very small percentages here).

Another way to put it, is that for every negative comment posted on a given story in Liberal/NeutralPolitics - there appears to be at least one positive comment as well when normalizing by grouping by Post ID first.

However - when viewing what the model claims is positive/negative, we can easily pick out quite a few items that most would have thrown into the negative column instead of positive. This may go both ways, but I have a feeling that the negative skew should be even higher. With time to train using other data we might be able to produce more accurate results.

Test 2 - Top-level Analysis Only

With all of these data pulls we've been making the assumption that we should pull the entirety of all comments on a story and NOT only the top-level comments. The only problem this represents is that we may now be pulling large threads of discussion in which folks could be arguing back and forth on the subject, thereby skewing negativity. If we try only pulling the top-level comments, perhaps we can get a better sense of the initial reaction from each user.

  1. Analyze top-level comments from 100 stories for sentiment, grouped by subreddit.
  2. Then, analyze top-level comments from 100 stories for sentiment, group by subreddit AND story ID (then aggregate).

Test 2 - Data Pull

In [ ]:
# Assign subreddits and print out some basic info to test
subreddits = [reddit.subreddit("Liberal"),reddit.subreddit("Conservative"),reddit.subreddit("NeutralPolitics")]

# Grab ID's for 100 stories in each subreddit (the "hot" 100)
story_ids = {}

for subreddit in subreddits:
    for submission in subreddit.hot(limit=100):
        
        #Loop through comment hierarchy and pull all levels
        while True:
            try:
                submission.comments.replace_more(limit=None) #Remove 'show more comments'
                break
            except PossibleExceptions:
                print("Handling replace_more exception")
                sleep(1)
        
        #Pull only top-level comments
        story_ids[submission.id] = [subreddit.display_name,submission.title,[top_level_comment for top_level_comment in submission.comments]] #Put in dictionary
        
# These ID's allow us to pull the sentiment for each individual story per subreddit
for key, value in story_ids.items():
     print(key, '->', value)

Sample output (truncated due to length):

mrmrmn -> ['Liberal', 'Nancy Pelosi Reveals Mitch McConnell Blocked Ruth Bader Ginsburg From Getting Capitol Rotunda Memorial', [Comment(id='guni694'), Comment(id='gun9kzd'), Comment(id='gun6g67'), Comment(id='gunrb5k'), Comment(id='guo43lv'), Comment(id='gun9pts'), Comment(id='gunzaij'), Comment(id='guo42tq'), Comment(id='gunx3hm'), Comment(id='guntl7b'), Comment(id='guo4i4j'), Comment(id='guo8isc'), Comment(id='guo51mi')]]
mre2ul -> ['Liberal', 'US jobless claims plunge to 576,000, lowest since pandemic', [Comment(id='gult7hk'), Comment(id='guls41x'), Comment(id='gulldj6'), Comment(id='gunkas3'), Comment(id='gulu5i6')]]
mrbjf5 -> ['Liberal', 'Democrats to introduce legislation to expand Supreme Court', [Comment(id='gum7bf4'), Comment(id='gum2scd'), Comment(id='gummtcu'), Comment(id='gum40vp'), Comment(id='gultr3g'), Comment(id='gulykpg'), Comment(id='guls3mm'), Comment(id='gune3t6'), Comment(id='gunk109'), Comment(id='gunts3x'), Comment(id='guo4vo6'), Comment(id='gumx8eh')]]
mrqzy2 -> ['Liberal', 'Nice! Schumer lays groundwork for future filibuster reform', []]
mrqbbo -> ['Liberal', 'Biden: ‘If Russia continues to interfere with our democracy, I’m prepared to take further actions’', []]
mrf49t -> ['Liberal', 'Biden Administration to Impose Tough Sanctions on Russia', [Comment(id='gum3jcu')]]
mrqjr4 -> ['Liberal', "WATCH: Maxine Waters erupts at Jim Jordan and tells him to 'respect the chair and shut your mouth' during COVID-19 hearing", [Comment(id='guo6h7f'), Comment(id='gunvd63')]]
In [17]:
# Dump comments into a pandas dataframe
df_rows = []

for key, value in story_ids.items():
    # Loop through comment lists in value[2]
    for comment in value[2]:
        df_rows.append([key, value[0], value[1], comment.id, comment.score, comment.created, comment.body])

df = pd.DataFrame(df_rows, columns=['Post ID', 'Subreddit', 'Post Title', 'Comment ID', 'Score', 'Created', 'Body'])
df = df[(df['Body'] == '[deleted]') | (df['Body'] == '[removed]')==False].reset_index(drop=True) #Exclude any deleted/removed comments
df.head()
Out[17]:
Post ID Subreddit Post Title Comment ID Score Created Body
0 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... guni694 17 1.618551e+09 What?! Holy shit!
1 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... gun9kzd 50 1.618547e+09 Does it really need to be revealed that Mitch ...
2 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... gun6g67 27 1.618546e+09 Just horrible.\n\n\n\n>“Mitch McConnell is not...
3 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... gunrb5k 7 1.618555e+09 Because of Course he did.
4 mrmrmn Liberal Nancy Pelosi Reveals Mitch McConnell Blocked R... guo43lv 6 1.618562e+09 when fuck knuckle McConnell dies, please make ...
In [18]:
# Very limited set now at 2890 comments, compared to 18k
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2890 entries, 0 to 2889
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Post ID     2890 non-null   object 
 1   Subreddit   2890 non-null   object 
 2   Post Title  2890 non-null   object 
 3   Comment ID  2890 non-null   object 
 4   Score       2890 non-null   int64  
 5   Created     2890 non-null   float64
 6   Body        2890 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 158.2+ KB

Test 2 - EDA

  1. Analyze all top-level comments from 100 stories for sentiment, grouped by subreddit.
  2. Then, analyze all top-level comments from 100 stories for sentiment, group by subreddit AND story ID (then aggregate).
In [19]:
# Gather sentiment for every comment and dump in list

custom_tokens = []
results = []
num_pos = 0
num_neg = 0
per_pos = 0.0
per_neg = 0.0

for index, row in df.iterrows():
    custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
    results.append([row['Post ID'], row['Subreddit'], row['Body'], classifier.classify(dict([token, True] for token in custom_tokens[index]))])
In [20]:
# Analyze sentiment gathered for top-level comments on all stories by subreddit
# Display percentage

df_r = pd.DataFrame([result[1:] for result in results], columns=['Subreddit','Body','Sentiment'])
df_r_p = df_r[['Subreddit','Sentiment']].pivot_table(index='Subreddit', columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
Out[20]:
Sentiment Negative Positive % Negative % Positive
Subreddit
Conservative 866 789 52.33 47.67
Liberal 261 227 53.48 46.52
NeutralPolitics 238 509 31.86 68.14
In [21]:
# Analyze sentiment gathered for top-level comments on all stories by subreddit AND story ID
# Display percentage

df_r = pd.DataFrame(results, columns=['Post ID','Subreddit','Body','Sentiment'])
df_r_p = df_r[['Post ID','Subreddit','Sentiment']].pivot_table(index=['Subreddit','Post ID'], columns='Sentiment', aggfunc=len, fill_value=0)
df_r_p['% Negative'] = round((df_r_p['Negative'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p['% Positive'] = round((df_r_p['Positive'] / (df_r_p['Positive'] + df_r_p['Negative'])) * 100,2)
df_r_p
Out[21]:
Sentiment Negative Positive % Negative % Positive
Subreddit Post ID
Conservative mr364d 97 112 46.41 53.59
mr4lz8 14 20 41.18 58.82
mr5564 32 36 47.06 52.94
mrbef8 16 9 64.00 36.00
mrbpev 6 3 66.67 33.33
... ... ... ... ... ...
NeutralPolitics mgyhyz 0 2 0.00 100.00
mm3oyu 6 2 75.00 25.00
mm4eaw 1 2 33.33 66.67
mmuprh 0 5 0.00 100.00
mpaobl 3 1 75.00 25.00

285 rows × 4 columns

In [22]:
# Group by Subreddit and take average of averages
df_r_p.groupby('Subreddit').agg({'Negative':'sum', 
                         'Positive':'sum', 
                         '% Negative':'mean', 
                         '% Positive':'mean'}).round(decimals=2)
Out[22]:
Negative Positive % Negative % Positive
Subreddit
Conservative 866 789 51.20 48.80
Liberal 261 227 50.12 49.88
NeutralPolitics 238 509 29.40 70.60

Test 2 - Conclusion

Surprisingly (at least to me), taking into account only top-level comments shows a huge move towards positive for NeutralPolitics. This seems to indicate that discussion within the subreddit is quite balanced, but initial top-level responses are biased towards positivity.

Test 3

  1. Pull same story from two different subreddits that are ideologically different
  2. Run sentiment analysis on all sets
  3. Compare sentiment percentage from each

In this case - I'll take a look at two different stories posted on the same topic within the Conservative subreddit, and the subreddit called "Politics". This subreddit tends to lean center-left naturally due to the user base of Reddit.com. This should at least give us an idea how right vs. left respond to differently or similarly to the same story.

In this case, it's a story detailing the ban of the subreddit "The_Donald", a far-right subreddit that was the center of much controversy. You would assume that the conservative leaning subreddit would produce more negative sentiment, while the center-left would produce more positive sentiment on the banning.

Test 3 - Algorithms

In [23]:
def returnAllComments():
    '''
    Return all comments from a given submission into a DataFrame
    '''
    #Loop through comment hierarchy and pull all levels
    df_rows = []

    #Loop through comment hierarchy
    while True:
        try:
            submission.comments.replace_more(limit=None) #Flatten comment tree
            break
        except PossibleExceptions:
            print("Handling replace_more exception")
            sleep(1)

    comments = submission.comments.list() #Move to list

    # Loop through comment lists
    for comment in comments:
        df_rows.append([comment.id, comment.score, comment.created, comment.body])

    df = pd.DataFrame(df_rows, columns=['Comment ID', 'Score', 'Created', 'Body'])
    return df[(df['Body'] == '[deleted]') | (df['Body'] == '[removed]')==False].reset_index(drop=True) #Exclude any deleted/removed comments
In [24]:
def runSingleStorySentimentAnalysis(df):
    '''
    Run sentiment analysis on a given DataFrame and return analysis
    '''
    # Gather sentiment for every comment and dump in list
    custom_tokens = []
    results = []
    num_pos = 0
    num_neg = 0
    per_pos = 0.0
    per_neg = 0.0

    for index, row in df.iterrows():
        custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
        results.append([classifier.classify(dict([token, True] for token in custom_tokens[index]))])

    # Display percentages
    df_r = pd.DataFrame([result for result in results], columns=['Sentiment'])
    df_r_p = pd.DataFrame(df_r.groupby('Sentiment').size(), columns=['Count'])
    df_r_p['% of Total'] = round((df_r_p / df_r_p.sum())*100,2)
    return df_r_p
In [25]:
def returnComments(df):
    '''
    Run sentiment analysis on a given DataFrame and return detail results to review
    '''
    # Gather sentiment for every comment and dump in list
    custom_tokens = []
    results = []
    num_pos = 0
    num_neg = 0
    per_pos = 0.0
    per_neg = 0.0

    for index, row in df.iterrows():
        custom_tokens.append(remove_noise(tweet_tknzr.tokenize(row['Body'])))
        results.append([row['Body'], classifier.classify(dict([token, True] for token in custom_tokens[index]))])
       
    #df_r = pd.DataFrame([result for result in results], columns=['Body','Sentiment'])
    return results

Test 3 - Conservative Subreddit

In [26]:
# https://www.reddit.com/r/Conservative/comments/hi3u79/reddit_bans_the_donald_forum_as_part_of_major/
    
submission = reddit.submission(id="hi3u79")
print(submission.title)
Reddit bans The_Donald forum as part of major hate speech purge
In [27]:
df = returnAllComments()
df
Out[27]:
Comment ID Score Created Body
0 fwdvdws 52 1.593480e+09 R/consumeproduct was banned as well. RIP
1 fwdv7gi 258 1.593480e+09 It’s been inactive for months lol. Good work r...
2 fwe2ike 40 1.593483e+09 Lol the press will focus on this matter and me...
3 fwdup9y 356 1.593480e+09 Yet r/politics remains...
4 fwdueft 65 1.593479e+09 I thought this was already taken down?
... ... ... ... ...
423 fwgu2vy 1 1.593548e+09 It doesn't need to be on point to make the arg...
424 fwff5le 1 1.593508e+09 > Yet I maintain that PoC who experience oppre...
425 fwinuo2 1 1.593581e+09 says the person who still somehow supports a r...
426 fwiv93o 1 1.593585e+09 Can’t argue with stupid so I shall stop lol
427 fwj2hzb 0 1.593588e+09 Yep! Trump indeed hates it when people call hi...

428 rows × 4 columns

In [28]:
runSingleStorySentimentAnalysis(df)
Out[28]:
Count % of Total
Sentiment
Negative 236 55.14
Positive 192 44.86
In [ ]:
returnComments(df)

Sample output (truncated due to length):

[['R/consumeproduct was banned as well. RIP', 'Negative'],
 ['It’s been inactive for months lol. Good work reddit. You effectively did nothing.',
  'Negative'],
 ['Lol the press will focus on this matter and mention r/the_donald only because they can up the antitrump ante. The sub was effectively a dead stump of its former self, so its like a beating the dead horse, but well you can always milk it for antitrump hate.',
  'Negative'],
 ['Yet r/politics remains...', 'Negative'],
 ['I thought this was already taken down?', 'Negative'],
 ['Major “you don’t fit that narrative” purge. 1984', 'Negative'],
 ['Official reddit policy now *explicitly* allows hate subreddits/comments/posts against identities in the "majority."',
  'Positive'],
 ['"hate speech"? Someone misspelled "political speech"', 'Negative'],
 ["Let them ban all the right leaning subs. And then still be shocked when Trump wins the election again.\n\nI cam't wait for the left crying compilation",
  'Negative'],

Interesting enough, the sentiment does lean negative - but only by 10 percentage points.

Test 3 - Liberal Subreddit

In [30]:
#https://www.reddit.com/r/Liberal/comments/hi4eva/reddit_finally_bans_hate_speech_removes_2000/

submission = reddit.submission(id="hi4eva")
print(submission.title)
Reddit Finally Bans Hate Speech, Removes 2,000 Racist and Violent Forums Including The_Donald
In [31]:
df = returnAllComments()
df
Out[31]:
Comment ID Score Created Body
0 fwe135b 59 1.593482e+09 Except the Canadian racist cesspool r/metacana...
1 fweaghv 140 1.593487e+09 Republicans: "Businesses should be allowed to ...
2 fwfdwlw 14 1.593507e+09 Genuinely curious what the general consensus i...
3 fwdyo8v 48 1.593481e+09 About fucking time.
4 fwe2p19 31 1.593483e+09 Words cannot express how fuckin happy I am rig...
... ... ... ... ...
184 fwekx5v -4 1.593492e+09 I never once said we should be deciding what s...
185 fwfpf4o 1 1.593514e+09 To your first point:\n1. Looking back at my pr...
186 fwel0fe 3 1.593492e+09 How do you think that is different?
187 fwfsr0t 1 1.593516e+09 Ok so since I haven’t made this clear no matte...
188 fwfwurw 1 1.593518e+09 I agree with all your points. I just see a wor...

189 rows × 4 columns

In [32]:
runSingleStorySentimentAnalysis(df)
Out[32]:
Count % of Total
Sentiment
Negative 112 59.26
Positive 77 40.74
In [ ]:
# Show details of how comments were classified
returnComments(df)

Sample output (truncated due to length):

[['Except the Canadian racist cesspool r/metacanada.\n\nStill waiting.......',
  'Negative'],
 ['Republicans: "Businesses should be allowed to reject any customer under any circumstances!"  \n  \nAlso Republicans:  "Reddit shouldn\'t be allowed to reject any customer under any circumstances!"',
  'Positive'],
 ['Genuinely curious what the general consensus is about the part that says it does not protect the majority. Seems discriminatory.\n\n>While the rule on hate protects such groups, it does not protect all groups or all forms of identity. For example, the rule does not protect groups of people who are in the majority',
  'Negative'],
 ['About fucking time.', 'Negative'],
 ['Words cannot express how fuckin happy I am right now!', 'Negative'],
 ['Did they ban all the subscribers too or just the reddit?', 'Negative'],
 ['This isn\'t to all of you, but it\'s to enough of you...\n\nWhat disabled person asked anyone to defend us from words?  I don\'t even remember electing a representative for that... Trying to "protect" me from words is actually pretty rude.

Test 3 - Conclusion

In this instance - we see a single story that is producing "mostly" negative comments within both subreddits. This result shows that sentiment is highly subjective, and that you cannot always simply infer that something will be seen is more positive or negative just because of the political leanings. In addition, this also shows that sentiment is once again very difficult in context. Although many negative comments from the subreddit Liberal are based upon the fact that censorship is wrong, many comments are classified as negative when in fact the comment being made is agreeing with the ban. These comments are marked as negative simply because of verbiage and not context. Either way, in this case, there still seems to be a lot of common ground being found within all users.