The greatest variance is shown on an orthogonal line perpendicular to the axis. Likewise, the second greatest variation on the second axis, and so on.

This allows us to reduce the number of variables used in an analysis.

Taking this a step further – we can expand to higher level of dimensions – shown as “components”.

If we utilize a dataset with a large number of variables, this helps us reduce the amount of variation to a small number of components – but these can be tough to interpret. A much more detailed walk-through on the theory can be found here. I’m going to show how this analysis can be done utilizing Scikit learn in Python. The dataset were going to be utilizing can be loaded directly within sklearn as shown below.
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Import Data
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
#Some info about this dataset
print(cancer['DESCR'])
#Convert to dataframe
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df.head()
#Import scaler and scale
from sklearn.preprocessing import StandardScaler
#Instantiate object
scaler = StandardScaler()
#Fit df to scaler
scaler.fit(df)
#Scale data
scaled_data = scaler.transform(df)
from sklearn.decomposition import PCA
#Create object with 2 componenets
pca = PCA(n_components=2)
#Fit scaled data
pca.fit(scaled_data)
#Transform
x_pca = pca.transform(scaled_data)
#Shows that we've reduced down to the 2 principal components
x_pca.shape
#Plot these
sns.set_style("darkgrid")
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='Accent')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
#We have the component values here
pca.components_
#Set these to a new DataFrame
df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names'])
df_comp.head()
plt.figure(figsize=(15,6))
sns.heatmap(df_comp,cmap='plasma')
#Import splitting library
from sklearn.model_selection import train_test_split
#Set X,Y
X = x_pca
y = cancer['target']
#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
#Import library
from sklearn.svm import SVC
#Create object
model = SVC()
#Fit
model.fit(X_train,y_train)
#Predict
predictions = model.predict(X_test)
#See if the model worked, print reports (worked very well)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
WordPress conversion from Principal Component Analysis – Simple Example.ipynb by nb2wp v0.3.1