A great simple example on how to deal with Logistic Regression in Python utilizing Matplotlib, Seaborn, and Scikit-learn. The data is pulled from Kaggle.com and found here. Normally this comes as a test and train set, but I’m analyzing the training set only to see the accuracy of the model.
The data here is a little dirty and necessitated some data cleansing first, probably worthy of another short tutorial / post in terms of some other methods of initial cleansing before working on data…
Logistic Regression - Simple Example¶
I.E. Binary choice comparison¶
Utilizes the Titanic Dataset from Kaggle.com. We have a training set and a testing set (CSV), but I'm only working on the training set as if it's the full set - showing how to split that into testing/training at the end. This will allow us to see the accuracy of the model
We'll walk through exploring, cleansing, and applying regression to this dataset
#Initial Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Import training data
titanic = pd.read_csv('titanic.csv')
Exploratory¶
titanic.head()
#We have some missing data here, so we're going to map out where NaN exists
#We might be able to take care of age, but cabin is probably too bad to save
sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Clean the missing data¶
You could jump to the end if you already have clean data, this one is just a bit messy
#Determine the average ages of passengers by class
#In an attempt to fix the NaN for this column somewhat
sns.set_style('whitegrid')
plt.figure(figsize=(10,7))
sns.boxplot(x='Pclass',y='Age',data=titanic)
#Fill out the Age column with the average ages of passengers per class
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37 #Return the avg age of passengers in the 1st class
elif Pclass == 2:
return 29 #2nd class
else:
return 24 #3rd class
else:
return Age
#Apply the function to the Age column
titanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)
#Drop the cabin data (plus anything else) since it's unsalvagable / not a big deal
titanic.drop('Cabin',axis=1,inplace=True)
titanic.dropna(inplace=True)
#Recheck the heatmap
#No more problems with Age
sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Other data cleansing, Parse down dataset¶
#Since Sex and Embark are columns that are perfect predictors of one another (Male of 0 always means 1 of Female)
#We merge these into single columns
sex = pd.get_dummies(titanic['Sex'],drop_first=True)
embark = pd.get_dummies(titanic['Embarked'],drop_first=True)
#Add the new columns to the dataset
titanic = pd.concat([titanic,sex,embark],axis=1)
#Drop the columns we can't use
titanic.drop(['PassengerId','Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
titanic.head()
Run Machine Learning Algorithm¶
Utilizing the training set ONLY and split it (pretending we don't have two CSV files already)
Determine survival
#Assign variables
X = titanic.drop('Survived', axis=1)
y = titanic['Survived']
from sklearn.model_selection import train_test_split
#Choose the test size
#Test size = % of dataset allocated for testing (.3 = 30%)
#Random state = # of random splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.linear_model import LogisticRegression
#Object, added max_inter since I hit a limit later
logmodel = LogisticRegression(max_iter=10000)
#Fit model
logmodel.fit(X_train,y_train)
#Form predictions
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
#Simple report
print(classification_report(y_test,predictions))
#Or print out a confusion matrix
print(confusion_matrix(y_test,predictions))