Just a little example on how to use Decision Trees and Random Forests in Python. Basically – tress are a type of “flow-chart”, a decision tree, with nodes/edges to determine a likely outcome. Given that trees can be very difficult to predict, we can utilize Random Forests to take specific features/variables to build multiple decision trees and then average the results. This is a very very brief and vague explanation – as these posts are meant to be quick shots of how-to code and not to teach the theory behind the method. In this example we’re utilizing a small healthcare dataset that predicts if spinal surgery for an individual was successful to help a particular condition (Kyphosis). Both methods are shown to see differences in accuracy. The dataset can be found on my github here.

Decision Trees and Random Forests - Simple Example

I.E. predict values based on "flow-chart" of nodes/edges

In this example we're utilizing a healthcare dataset that predicts if spinal surgery for an individual was successful to help a particular condition (Kyphosis).

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
#Read in the data
df = pd.read_csv('kyphosis.csv')
In [4]:
df.head()
#Kyphosis = Was the condition absent or present after the surgery?
#Age = Age of person in months
#Number = Number of vertebrae involved
#Start = Number of top most vertebrae operated on
Out[4]:
Kyphosis Age Number Start
0 absent 71 3 5
1 absent 158 3 14
2 present 128 4 5
3 absent 2 5 1
4 absent 1 4 15
In [29]:
df.info()
#You'll see this is a very small dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Kyphosis  81 non-null     object
 1   Age       81 non-null     int64 
 2   Number    81 non-null     int64 
 3   Start     81 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 2.7+ KB

Decision Tree - Split data into test/train and predict

In [8]:
#Import splitting library
from sklearn.model_selection import train_test_split
In [10]:
#Set X,Y
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
In [11]:
#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
In [12]:
#Import library
from sklearn.tree import DecisionTreeClassifier
In [13]:
#Create object
dtree = DecisionTreeClassifier()
In [14]:
#Fit
dtree.fit(X_train,y_train)
Out[14]:
DecisionTreeClassifier()
In [15]:
#Predict
predictions = dtree.predict(X_test)
In [19]:
#See if the model worked, print reports (not particularly wonderful here...)
from sklearn.metrics import classification_report, confusion_matrix
In [20]:
print(confusion_matrix(y_test,predictions))
[[12  5]
 [ 6  2]]
In [21]:
print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

      absent       0.67      0.71      0.69        17
     present       0.29      0.25      0.27         8

    accuracy                           0.56        25
   macro avg       0.48      0.48      0.48        25
weighted avg       0.54      0.56      0.55        25

Random Forest - Split data into test/train and predict

You'll notice this performs much better (in this instance) than the decision tree above

In [22]:
#Import library
from sklearn.ensemble import RandomForestClassifier
In [23]:
#Create object = set estimators to 200
rfc = RandomForestClassifier(n_estimators=200)
In [24]:
#Fit
rfc.fit(X_train,y_train)
Out[24]:
RandomForestClassifier(n_estimators=200)
In [25]:
#Predict
rfc_predictions = rfc.predict(X_test)
In [27]:
#See if the model worked, print reports (MUCH better here)
print(confusion_matrix(y_test,rfc_predictions))
[[17  0]
 [ 6  2]]
In [28]:
print(classification_report(y_test,rfc_predictions))
              precision    recall  f1-score   support

      absent       0.74      1.00      0.85        17
     present       1.00      0.25      0.40         8

    accuracy                           0.76        25
   macro avg       0.87      0.62      0.62        25
weighted avg       0.82      0.76      0.71        25