Just a little example on how to use Decision Trees and Random Forests in Python. Basically – tress are a type of “flow-chart”, a decision tree, with nodes/edges to determine a likely outcome. Given that trees can be very difficult to predict, we can utilize Random Forests to take specific features/variables to build multiple decision trees and then average the results. This is a very very brief and vague explanation – as these posts are meant to be quick shots of how-to code and not to teach the theory behind the method. In this example we’re utilizing a small healthcare dataset that predicts if spinal surgery for an individual was successful to help a particular condition (Kyphosis). Both methods are shown to see differences in accuracy. The dataset can be found on my github here.
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Read in the data
df = pd.read_csv('kyphosis.csv')
df.head()
#Kyphosis = Was the condition absent or present after the surgery?
#Age = Age of person in months
#Number = Number of vertebrae involved
#Start = Number of top most vertebrae operated on
df.info()
#You'll see this is a very small dataset
Decision Tree - Split data into test/train and predict¶
#Import splitting library
from sklearn.model_selection import train_test_split
#Set X,Y
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
#Import library
from sklearn.tree import DecisionTreeClassifier
#Create object
dtree = DecisionTreeClassifier()
#Fit
dtree.fit(X_train,y_train)
#Predict
predictions = dtree.predict(X_test)
#See if the model worked, print reports (not particularly wonderful here...)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
Random Forest - Split data into test/train and predict¶
You'll notice this performs much better (in this instance) than the decision tree above
#Import library
from sklearn.ensemble import RandomForestClassifier
#Create object = set estimators to 200
rfc = RandomForestClassifier(n_estimators=200)
#Fit
rfc.fit(X_train,y_train)
#Predict
rfc_predictions = rfc.predict(X_test)
#See if the model worked, print reports (MUCH better here)
print(confusion_matrix(y_test,rfc_predictions))
print(classification_report(y_test,rfc_predictions))