Just a little example on how to use a K Means Clustering model in Python. Or – how to take data and predict likely assignments among groupings of a given number of clusters. In this example we’re utilizing a fairly generic dataset of universities in which we’re going to predict clusters of Public or Private universities (2 clusters). In the case of this data we know if they are Public or Private so we can actually evaluate the accuracy of the model, which would not be a common ability in most real world applications. The dataset can be found on my github here.
K Means Clustering - Simple Example¶
For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public.
The Data¶
We will use a data frame with 777 observations on the following 18 variables.
- Private A factor with levels No and Yes indicating private or public university
- Apps Number of applications received
- Accept Number of applications accepted
- Enroll Number of new students enrolled
- Top10perc Pct. new students from top 10% of H.S. class
- Top25perc Pct. new students from top 25% of H.S. class
- F.Undergrad Number of fulltime undergraduates
- P.Undergrad Number of parttime undergraduates
- Outstate Out-of-state tuition
- Room.Board Room and board costs
- Books Estimated book costs
- Personal Estimated personal spending
- PhD Pct. of faculty with Ph.D.’s
- Terminal Pct. of faculty with terminal degree
- S.F.Ratio Student/faculty ratio
- perc.alumni Pct. alumni who donate
- Expend Instructional expenditure per student
- Grad.Rate Graduation rate
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid")
#Data Imports
df = pd.read_csv('College_Data', index_col=0)
df.head()
Minor Data Cleansing¶
Notice how there seems to be a private school with a graduation rate of higher than 100%. This needs to be found and altered.
#Find the error
df[df['Grad.Rate'] > 100]
#Set that school's graduation rate to 100 (error is ok)
df['Grad.Rate']['Cazenovia College'] = 100
K Means Cluster Creation¶
Now it is time to create the Cluster labels!
#Import Library
from sklearn.cluster import KMeans
#Create object, set to 2 clusters
kmeans = KMeans(n_clusters=2)
#Fit Model
kmeans.fit(df.drop('Private',axis=1))
From here we can return the center of the clusters that are predicted from the fit model¶
kmeans.cluster_centers_
Now we can plot those¶
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1])
Evaluation of model¶
There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters.
Since we can't compare the Yes/No string in the Private column, we're simply going to create a new column called "Cluster" that changes that to a flag of 0/1 (0 if Private, 1 if Public)
def checker(private):
if private=='Yes':
return 1
else:
return 0
df['Cluster'] = df['Private'].apply(checker)
df.head()
#Create confusion matrix and classification report, comparing the new columns to the KMeans Labels (predicted Private/Public)
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['Cluster'],kmeans.labels_))
print(classification_report(df['Cluster'],kmeans.labels_))
Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! Hopefully you can begin to see how K Means is useful for clustering un-labeled data!
Visualize some of these predicted values against actual¶
At this level, you can see the difficulty of attempting to classify these points. I'm comparing the predicted Private/Public value against the KMeans predicted labels for several pairs of attributes
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10,5))
ax[0].set_title("Predicted")
ax[1].set_title("Actual Dataset")
sns.scatterplot(y='Grad.Rate', x='Room.Board', data=df, hue=kmeans.labels_, ax=ax[0])
sns.scatterplot(y='Grad.Rate', x='Room.Board', data=df, hue='Private', ax=ax[1])
ax[0].legend(loc='lower right')
ax[1].legend(loc='lower right')
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10,5))
ax[0].set_title("Predicted")
ax[1].set_title("Actual Dataset")
sns.scatterplot(y="F.Undergrad", x="Outstate", data=df, hue=kmeans.labels_, ax=ax[0])
sns.scatterplot(y="F.Undergrad", x="Outstate", data=df, hue='Private', ax=ax[1])