Another little bit of NLP comparing Topic clustering in an unsupervised problem. Details below in the Jupyter Notebook.

Natural Language Processing in Python

LDA vs NMF Topic clustering

A quick no-nonsense overview of utilizing LDA (Latent Dirichlet Allocation) vs. NMF (Non-Negative Matrix Factorization) for unsupervised topic clustering problems in NLP.

The dataset I'm using is located here, and contains a large set of scraped news-stories from NPR (mostly political).

Through these two examples I will categorize each article into one of 8 topics that are defined through LDA/NMF.

LDA

Note that for LDA we need to utilize the CountVectorizer instead of TF-IDF as LDA relies on per word-count vectorization.

In [1]:
# Imports
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
In [2]:
# Data
npr = pd.read_csv('Datasets/npr.csv')
In [3]:
npr.head()
Out[3]:
Article
0 In the Washington of 2016, even when the polic...
1 Donald Trump has used Twitter — his prefe...
2 Donald Trump is unabashedly praising Russian...
3 Updated at 2:50 p. m. ET, Russian President Vl...
4 From photography, illustration and video, to d...
In [4]:
npr.shape
Out[4]:
(11992, 1)

Form Vector Matrices

In [5]:
# Instantial CV object
# Words show up in a maximum of 90% of the documents in NPR
# Words MUST show up in a minimum of 2 documents
# Eliminate English stop-words
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')
In [6]:
# Create Document Term Matrix - fit transform to Article text in DataFrame
# Number of articles by words
dtm = cv.fit_transform(npr['Article'])
dtm.shape
Out[6]:
(11992, 54777)
In [7]:
# Instantiate LDA, form 8 Topics
LDA = LatentDirichletAllocation(n_components=8,random_state=42)
In [8]:
# Fit to dtm (may take a while)
LDA.fit(dtm)
Out[8]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=8, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

Categorize Articles into Topics

cv --> contains vocabulary list of all words that were in all the documents
LDA.components_ --> 8 topics with array of 'probabilities' that a word exists in a specific topic

In [9]:
# List of words in cv (picked 1500 index at random)
cv.get_feature_names()[1500]
Out[9]:
'accreditor'
In [10]:
# 8 Topics, with index positions of every word
LDA.components_.shape
Out[10]:
(8, 54777)
In [11]:
# Array of index positions, ranked least to greatest
# The greatest values are most likely to be related to the specified topic
# Code below will give the 10 most likely records to be related to Topic '0' (defined as components_[0])
top10 = LDA.components_[0].argsort()[-10:]
top10
Out[11]:
array([21228, 36310, 18349, 31464, 10425,  8149, 36283, 22673, 42561,
       42993])
In [12]:
# Map the top 10 above for Topic 0 to English words
# These words are clustered into Topic 0
for i in top10:
    print(cv.get_feature_names()[i])
government
percent
federal
million
company
care
people
health
said
says
In [13]:
# Print out the top 10 words for each of the 8 topics
for i,topic in enumerate(LDA.components_):
    print(f"THE TOP 10 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
    print("\n")
    print("\n")
THE TOP 10 WORDS FOR TOPIC #0
['government', 'percent', 'federal', 'million', 'company', 'care', 'people', 'health', 'said', 'says']




THE TOP 10 WORDS FOR TOPIC #1
['new', 'security', 'obama', 'news', 'white', 'russia', 'house', 'president', 'said', 'trump']




THE TOP 10 WORDS FOR TOPIC #2
['world', 'time', 'water', 'years', 'new', 'food', 'just', 'people', 'like', 'says']




THE TOP 10 WORDS FOR TOPIC #3
['just', 'disease', 'patients', 'children', 'like', 'study', 'women', 'health', 'people', 'says']




THE TOP 10 WORDS FOR TOPIC #4
['vote', 'party', 'republican', 'campaign', 'president', 'people', 'state', 'clinton', 'said', 'trump']




THE TOP 10 WORDS FOR TOPIC #5
['new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like']




THE TOP 10 WORDS FOR TOPIC #6
['time', 'schools', 'people', 'education', 'just', 'new', 'like', 'students', 'school', 'says']




THE TOP 10 WORDS FOR TOPIC #7
['reported', 'government', 'according', 'city', 'told', 'reports', 'says', 'people', 'police', 'said']




In [14]:
# Array of percentage of probabilities that document belongs to a topic
# We can see that the first document is most likely to be associated with Topic 1 at 90% (indexed at zero)
topic_results = LDA.transform(dtm)
topic_results[0].round(2)
Out[14]:
array([0.01, 0.9 , 0.  , 0.  , 0.08, 0.  , 0.  , 0.  ])
In [15]:
# Add Topic number to NPR DataFrame to associate story with a Topic
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()
Out[15]:
Article Topic
0 In the Washington of 2016, even when the polic... 1
1 Donald Trump has used Twitter — his prefe... 1
2 Donald Trump is unabashedly praising Russian... 1
3 Updated at 2:50 p. m. ET, Russian President Vl... 1
4 From photography, illustration and video, to d... 7

NMF

Utilizing TF-IDF instead of CountVectorizer
Follows nearly the exact same pattern

In [16]:
# Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
In [17]:
# Instantial TF-IDF object
# Words show up in a maximum of 90% of the documents in NPR
# Words MUST show up in a minimum of 2 documents
# Eliminate English stop-words
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')
In [18]:
# Create Document Term Matrix - fit transform to Article text in DataFrame
# Number of articles by words
dtm = tfidf.fit_transform(npr['Article'])
dtm.shape
Out[18]:
(11992, 54777)
In [19]:
# Instantiate NMF, form 8 Topics
nmf_model = NMF(n_components=8,random_state=42)
In [20]:
# Fit model
nmf_model.fit(dtm)
Out[20]:
NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=8, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

Categorize Articles into Topics

tfidf --> contains vocabulary list of all words that were in all the documents
nmfmodel.components --> 8 topics with array of 'probabilities' that a word exists in a specific topic

In [21]:
# List of words in tfidf (picked 1500 index at random)
tfidf.get_feature_names()[1500]
Out[21]:
'accreditor'
In [22]:
# 8 Topics, with index positions of every word
nmf_model.components_.shape
Out[22]:
(8, 54777)
In [23]:
# Array of index positions, ranked least to greatest
# The greatest values are most likely to be related to the specified topic
# Code below will give the 10 most likely records to be related to Topic '0' (defined as components_[0])
top10 = nmf_model.components_[0].argsort()[-10:]
top10
Out[23]:
array([47218, 26752, 54412, 33390, 36310, 28659, 53152, 19307, 36283,
       42993])
In [24]:
# Map the top 10 above for Topic 0 to English words
# These words are clustered into Topic 0
for i in top10:
    print(tfidf.get_feature_names()[i])
study
just
years
new
percent
like
water
food
people
says
In [25]:
# Print out the top 10 words for each of the 8 topics
for i,topic in enumerate(nmf_model.components_):
    print(f"THE TOP 10 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
    print("\n")
    print("\n")
THE TOP 10 WORDS FOR TOPIC #0
['study', 'just', 'years', 'new', 'percent', 'like', 'water', 'food', 'people', 'says']




THE TOP 10 WORDS FOR TOPIC #1
['election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']




THE TOP 10 WORDS FOR TOPIC #2
['law', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']




THE TOP 10 WORDS FOR TOPIC #3
['state', 'law', 'isis', 'russia', 'president', 'attack', 'reports', 'court', 'said', 'police']




THE TOP 10 WORDS FOR TOPIC #4
['party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']




THE TOP 10 WORDS FOR TOPIC #5
['album', 'life', 'song', 'really', 'people', 'know', 'think', 'just', 'music', 'like']




THE TOP 10 WORDS FOR TOPIC #6
['devos', 'children', 'college', 'kids', 'teachers', 'student', 'education', 'schools', 'school', 'students']




THE TOP 10 WORDS FOR TOPIC #7
['pregnant', 'microcephaly', 'cases', 'mosquitoes', 'health', 'disease', 'mosquito', 'women', 'virus', 'zika']




In [26]:
# Array of percentage of probabilities that document belongs to a topic
# We can see that the first document is most likely to be associated with Topic 1 at 12% (indexed at zero)
topic_results = nmf_model.transform(dtm)
topic_results[0].round(2)
Out[26]:
array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  , 0.  ])
In [27]:
# Add Topic number to NPR DataFrame to associate story with a Topic
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()
Out[27]:
Article Topic
0 In the Washington of 2016, even when the polic... 1
1 Donald Trump has used Twitter — his prefe... 1
2 Donald Trump is unabashedly praising Russian... 1
3 Updated at 2:50 p. m. ET, Russian President Vl... 3
4 From photography, illustration and video, to d... 6