Just a little example on how to use the K-Nearest Neighbor model in Python. In this example we’re utilizing an anonymous dataset with unknown context. All we know is that there is a TARGET CLASS of 0/1 and we want to predict if a row belongs to that class or not. K-Nearest Neighbor functions by comparing a point to a certain number of points around it. It’s pretty simple to implement. The dataset can be found on my github here.
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Read in the data
df = pd.read_csv('Classified Data',index_col=0)
df.head()
Scale Standardization of dataset¶
Tranform data such that its distribution will have a mean value of 0, and standard deviation of 1 Essentially it it "standardizes" the dataset so that the scales of every metric are common, useful when comparing data of different units
#Import Sci-kit learn
from sklearn.preprocessing import StandardScaler
#Scaler Object
scaler = StandardScaler()
#Fit data to everything except Target Class column first (before transforming)
scaler.fit(df.drop('TARGET CLASS',axis=1))
#Scale the dataset, minus the TARGET ClASS column
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
#Put the new scaled dataset into a dataframe
#Get all column names except TARGET CLASS (the last one)
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
#Review "re-scaled" dataset
df_feat.head()
Split data into test/train and predict¶
#Import splitting library
from sklearn.model_selection import train_test_split
#Set X,Y
X = df_feat
y = df['TARGET CLASS']
#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
#Import KNN classifier library
from sklearn.neighbors import KNeighborsClassifier
#Build model object
knn = KNeighborsClassifier(n_neighbors=1)
#Fit model
knn.fit(X_train,y_train)
#Get predictions
predictions = knn.predict(X_test)
#See if the model worked, print reports (overall it worked very well)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
Determining a better K-value¶
How exactly can we determine what K-value produces the least error rate?
#First let's see what error_rate we get by looping k-values
error_rate = [] #Empty
#Choose the max k-value to test (40 this time), start at 1
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
predictions_i = knn.predict(X_test)
error_rate.append(np.mean(predictions_i != y_test)) #Avg times the prediction is NOT correct
#Plot out the error rates
plt.figure(figsize=(10,5))
plt.plot(range(1,40),error_rate,color='blue',linestyle='--',marker='o',markerfacecolor='red',markersize=10)
plt.title('Error Rate depending on K-value')
plt.xlabel('K-value')
plt.ylabel('Error Rate')