I've already briefly done some work in the dataset in my tutorial for Logistic Regression - but never in entirety. I decided to re-evaluate utilizing Random Forest and submit to Kaggle. In this dataset, we're utilizing a testing/training dataset of passengers on the Titanic in which we need to predict if passengers survived or not (1 or 0).
Titanic Dataset - Kaggle SubmissionYou can view my Kaggle submissions here.
Initial Imports
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Data
In [2]:
#Import training data
df_test = pd.read_csv('test.csv')
df_train = pd.read_csv('train.csv')
In [3]:
df_test.head()
Out[3]:
PassengerId
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
892
3
Kelly, Mr. James
male
34.5
0
0
330911
7.8292
NaN
Q
1
893
3
Wilkes, Mrs. James (Ellen Needs)
female
47.0
1
0
363272
7.0000
NaN
S
2
894
2
Myles, Mr. Thomas Francis
male
62.0
0
0
240276
9.6875
NaN
Q
3
895
3
Wirz, Mr. Albert
male
27.0
0
0
315154
8.6625
NaN
S
4
896
3
Hirvonen, Mrs. Alexander (Helga E Lindqvist)
female
22.0
1
1
3101298
12.2875
NaN
S
In [4]:
df_train.head()
Out[4]:
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
Cleanse Missing Data
In [5]:
#We have some missing data here, so we're going to map out where NaN exists
#We might be able to take care of age, but cabin is probably too bad to save
sns.heatmap(df_train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at...