Another little bit of NLP comparing Topic clustering in an unsupervised problem. Details below in the Jupyter Notebook.
Natural Language Processing in Python¶
LDA vs NMF Topic clustering¶
A quick no-nonsense overview of utilizing LDA (Latent Dirichlet Allocation) vs. NMF (Non-Negative Matrix Factorization) for unsupervised topic clustering problems in NLP.
The dataset I'm using is located here, and contains a large set of scraped news-stories from NPR (mostly political).
Through these two examples I will categorize each article into one of 8 topics that are defined through LDA/NMF.
LDA¶
Note that for LDA we need to utilize the CountVectorizer instead of TF-IDF as LDA relies on per word-count vectorization.
# Imports
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Data
npr = pd.read_csv('Datasets/npr.csv')
npr.head()
npr.shape
Form Vector Matrices¶
# Instantial CV object
# Words show up in a maximum of 90% of the documents in NPR
# Words MUST show up in a minimum of 2 documents
# Eliminate English stop-words
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')
# Create Document Term Matrix - fit transform to Article text in DataFrame
# Number of articles by words
dtm = cv.fit_transform(npr['Article'])
dtm.shape
# Instantiate LDA, form 8 Topics
LDA = LatentDirichletAllocation(n_components=8,random_state=42)
# Fit to dtm (may take a while)
LDA.fit(dtm)
Categorize Articles into Topics¶
cv --> contains vocabulary list of all words that were in all the documents
LDA.components_ --> 8 topics with array of 'probabilities' that a word exists in a specific topic
# List of words in cv (picked 1500 index at random)
cv.get_feature_names()[1500]
# 8 Topics, with index positions of every word
LDA.components_.shape
# Array of index positions, ranked least to greatest
# The greatest values are most likely to be related to the specified topic
# Code below will give the 10 most likely records to be related to Topic '0' (defined as components_[0])
top10 = LDA.components_[0].argsort()[-10:]
top10
# Map the top 10 above for Topic 0 to English words
# These words are clustered into Topic 0
for i in top10:
print(cv.get_feature_names()[i])
# Print out the top 10 words for each of the 8 topics
for i,topic in enumerate(LDA.components_):
print(f"THE TOP 10 WORDS FOR TOPIC #{i}")
print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
print("\n")
print("\n")
# Array of percentage of probabilities that document belongs to a topic
# We can see that the first document is most likely to be associated with Topic 1 at 90% (indexed at zero)
topic_results = LDA.transform(dtm)
topic_results[0].round(2)
# Add Topic number to NPR DataFrame to associate story with a Topic
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()
NMF¶
Utilizing TF-IDF instead of CountVectorizer
Follows nearly the exact same pattern
# Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
# Instantial TF-IDF object
# Words show up in a maximum of 90% of the documents in NPR
# Words MUST show up in a minimum of 2 documents
# Eliminate English stop-words
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')
# Create Document Term Matrix - fit transform to Article text in DataFrame
# Number of articles by words
dtm = tfidf.fit_transform(npr['Article'])
dtm.shape
# Instantiate NMF, form 8 Topics
nmf_model = NMF(n_components=8,random_state=42)
# Fit model
nmf_model.fit(dtm)
Categorize Articles into Topics¶
tfidf --> contains vocabulary list of all words that were in all the documents
nmfmodel.components --> 8 topics with array of 'probabilities' that a word exists in a specific topic
# List of words in tfidf (picked 1500 index at random)
tfidf.get_feature_names()[1500]
# 8 Topics, with index positions of every word
nmf_model.components_.shape
# Array of index positions, ranked least to greatest
# The greatest values are most likely to be related to the specified topic
# Code below will give the 10 most likely records to be related to Topic '0' (defined as components_[0])
top10 = nmf_model.components_[0].argsort()[-10:]
top10
# Map the top 10 above for Topic 0 to English words
# These words are clustered into Topic 0
for i in top10:
print(tfidf.get_feature_names()[i])
# Print out the top 10 words for each of the 8 topics
for i,topic in enumerate(nmf_model.components_):
print(f"THE TOP 10 WORDS FOR TOPIC #{i}")
print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
print("\n")
print("\n")
# Array of percentage of probabilities that document belongs to a topic
# We can see that the first document is most likely to be associated with Topic 1 at 12% (indexed at zero)
topic_results = nmf_model.transform(dtm)
topic_results[0].round(2)
# Add Topic number to NPR DataFrame to associate story with a Topic
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()