Introduction

On April 10th, 1912, the RMS Titanic departed on its maiden voyage from Southampton to New York City. After making stops in Cherbourg and Queenstown to pick up additional passengers, the ship headed towards New York. There were 2,224 passengers including 892 crew members aboard the ship. This was under the ship’s capacity of around 3,300. The ship was equipped with watertight compartments that were designed to fill up in the event of a breach to the ship’s hull. This technology was said to make the ship “unsinkable”.

The Titanic was scheduled to arrive at New York Pier 59 on the morning of April 17th. But at around 11:40 p.m. on the night of April 14th, the ship collided with an iceberg. The ship’s hull was not punctured by the collision, however it weakened the seams of the hull causing them to separate. Several watertight compartments filled with water and the ship eventually sank in the Atlantic Ocean about 400 miles off the coast of Newfoundland.

Distress signals were sent out, but no ships were near enough to reach the Titanic before she sank. At around 4 a.m., the the RMS Carpathia arrived on scene in response to the distress signals The ship had 20 lifeboats which had capacity for only 1,178 passengers. Approximately 1,500 people lost their lives in this tragic event. About 710 people survived and were taken to New York via the Carpathia.

So, who were the passengers that survived this tragedy? Fortunately, we have data which can give us insight into who survived and who perished. The goal is to use machine learning methods to identify patterns in the data to predict which passengers survived and which ones did not.

The Dataset

The Titanic dataset comes from kaggle. The training set contains 891 rows and 12 columns with one row per passenger. The test set includes 418 rows and 11 columns (the survived column is missing for the sake of the competition).

The columns of the test set are:

  • PassengerId: Kaggle passenger id
  • Survived: 1 = passenger survived, 0 = passenger did not survive
  • Pclass: Ticket class
  • Name: Full name of passenger including title
  • Sex: Sex of passenger
  • Age: Age of passenger
  • SibSp: Number of siblings and spouses aboard the ship
  • Parch: Number of parents and children aboard the ship
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of embarkation

Goal of Analysis

sources:

  • https://en.wikipedia.org/wiki/RMS_Titanic
  • https://www.kaggle.com/c/titanic/data

Import Data

Load the standard Libraries

First, load the standard python libraries.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline

Next, we import the training data from the train.csv file and save it as data. After the data is loaded into our jupyter notebook, we convert it into a pandas dataframe and inspect it.

# import train data and save it as data
data = pd.read_csv('train.csv')

# load data into a pandas dataframe and save as df
df = pd.DataFrame(data)

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Data Exploration

Now that the data is loaded, we can begin exploration. The data has already been split into training and test sets. We will explore the training set and act as if the test set is not available to us at this time. This will allow us to get a better measure of the accuracy our model and avoid over fitting.

The describe function is used to view summaries of the numerical columns in the dataset. This also helps to identify columns with null values. Notice the count of Age compared to the other columns.

df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

The corr method calculates the correlation between variables.

c = df.corr()
c
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
# Get a count of all columns including text fields
df.count()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

Explore Age

# Unique values for age
df.Age.unique()
array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])
# distribution of passengers by age
df.groupby(pd.qcut(df['Age'],10))['Survived'].mean().plot(figsize=(8,6), kind='bar', color='blue', alpha=.2)

Explore P Class

df['Pclass'].value_counts(dropna = False)
3    491
1    216
2    184
Name: Pclass, dtype: int64
# Survivors by Pclass

Explore Pclass

df.groupby('Pclass').describe()
Age Fare ... SibSp Survived
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Pclass
1 186.0 38.233441 14.802856 0.92 27.0 37.0 49.0 80.0 216.0 84.154687 ... 1.0 3.0 216.0 0.629630 0.484026 0.0 0.0 1.0 1.0 1.0
2 173.0 29.877630 14.001077 0.67 23.0 29.0 36.0 70.0 184.0 20.662183 ... 1.0 3.0 184.0 0.472826 0.500623 0.0 0.0 0.0 1.0 1.0
3 355.0 25.140620 12.495398 0.42 18.0 24.0 32.0 74.0 491.0 13.675550 ... 1.0 8.0 491.0 0.242363 0.428949 0.0 0.0 0.0 0.0 1.0

3 rows × 48 columns

Survivors by Pclass

df.groupby('Pclass')['Survived'].mean()
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
df.groupby('Pclass')['Survived'].mean().plot(kind='bar',width=0.8)

Explore Embarked Location

df['Embarked'].value_counts(dropna = False)
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64
pd.crosstab(df['Pclass'],df['Embarked'])
Embarked C Q S
Pclass
1 85 2 127
2 17 3 164
3 66 72 353

Survivors by Embarked Location

df.groupby('Embarked')['Survived'].mean()
Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64
df.groupby('Embarked')['Survived'].mean().plot(kind='bar',width=0.8)

Feature Engineering

Age Binning

To make the age field more informative, we can group passengers into age bins.

df['age_bin'] = pd.qcut(df['Age'],10)
pd.crosstab(df['age_bin'],df['SibSp'])
SibSp 0 1 2 3 4 5
age_bin
(0.419, 14.0] 20 24 6 7 16 4
(14.0, 19.0] 60 20 3 1 2 1
(19.0, 22.0] 55 9 3 0 0 0
(22.0, 25.0] 48 15 5 2 0 0
(25.0, 28.0] 45 14 2 0 0 0
(28.0, 31.8] 47 18 0 1 0 0
(31.8, 36.0] 63 26 1 1 0 0
(36.0, 41.0] 34 17 2 0 0 0
(41.0, 50.0] 50 26 2 0 0 0
(50.0, 80.0] 49 14 1 0 0 0
pd.crosstab(df['Parch'],df['SibSp'])
SibSp 0 1 2 3 4 5 8
Parch
0 537 123 16 2 0 0 0
1 38 57 7 7 9 0 0
2 29 19 4 7 9 5 7
3 1 3 1 0 0 0 0
4 1 3 0 0 0 0 0
5 2 3 0 0 0 0 0
6 0 1 0 0 0 0 0
df['family_members']=df['Parch']+df['SibSp']
df.groupby('family_members')['Survived'].agg(['mean', 'size'])
mean size
family_members
0 0.303538 537
1 0.552795 161
2 0.578431 102
3 0.724138 29
4 0.200000 15
5 0.136364 22
6 0.333333 12
7 0.000000 6
10 0.000000 7
df.groupby('age_bin')['Survived'].mean()
age_bin
(0.419, 14.0]    0.584416
(14.0, 19.0]     0.390805
(19.0, 22.0]     0.283582
(22.0, 25.0]     0.371429
(25.0, 28.0]     0.393443
(28.0, 31.8]     0.393939
(31.8, 36.0]     0.483516
(36.0, 41.0]     0.358491
(41.0, 50.0]     0.397436
(50.0, 80.0]     0.343750
Name: Survived, dtype: float64

Replacing Null Values of Age

# Save median age of train set to use with transform of test set
med_age = df["Age"].median()

# Replace null values of age with median age
df["Age_Fill_Med"] = df["Age"].fillna(med_age)

Get Title From Name

df['Title'] = df['Name'].apply(lambda s: s.split(', ', 1)[1].split(' ',1)[0])
#s=df['Name'].str.split(', ', 1)
pd.crosstab(df['Title'],df['family_members'])
family_members 0 1 2 3 4 5 6 7 10
Title
Capt. 0 0 1 0 0 0 0 0 0
Col. 2 0 0 0 0 0 0 0 0
Don. 1 0 0 0 0 0 0 0 0
Dr. 5 0 2 0 0 0 0 0 0
Jonkheer. 1 0 0 0 0 0 0 0 0
Lady. 0 1 0 0 0 0 0 0 0
Major. 2 0 0 0 0 0 0 0 0
Master. 0 3 15 4 2 9 3 3 1
Miss. 100 27 22 10 9 4 6 1 3
Mlle. 2 0 0 0 0 0 0 0 0
Mme. 1 0 0 0 0 0 0 0 0
Mr. 397 68 35 6 1 5 1 1 3
Mrs. 20 59 27 9 3 4 2 1 0
Ms. 1 0 0 0 0 0 0 0 0
Rev. 4 2 0 0 0 0 0 0 0
Sir. 0 1 0 0 0 0 0 0 0
the 1 0 0 0 0 0 0 0 0
df.groupby('Title').describe()
Age Age_Fill_Med ... Survived family_members
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Title
Capt. 1.0 70.000000 NaN 70.00 70.000 70.0 70.00 70.0 1.0 70.000000 ... 0.00 0.0 1.0 2.000000 NaN 2.0 2.0 2.0 2.00 2.0
Col. 2.0 58.000000 2.828427 56.00 57.000 58.0 59.00 60.0 2.0 58.000000 ... 0.75 1.0 2.0 0.000000 0.000000 0.0 0.0 0.0 0.00 0.0
Don. 1.0 40.000000 NaN 40.00 40.000 40.0 40.00 40.0 1.0 40.000000 ... 0.00 0.0 1.0 0.000000 NaN 0.0 0.0 0.0 0.00 0.0
Dr. 6.0 42.000000 12.016655 23.00 35.000 46.5 49.75 54.0 7.0 40.000000 ... 1.00 1.0 7.0 0.571429 0.975900 0.0 0.0 0.0 1.00 2.0
Jonkheer. 1.0 38.000000 NaN 38.00 38.000 38.0 38.00 38.0 1.0 38.000000 ... 0.00 0.0 1.0 0.000000 NaN 0.0 0.0 0.0 0.00 0.0
Lady. 1.0 48.000000 NaN 48.00 48.000 48.0 48.00 48.0 1.0 48.000000 ... 1.00 1.0 1.0 1.000000 NaN 1.0 1.0 1.0 1.00 1.0
Major. 2.0 48.500000 4.949747 45.00 46.750 48.5 50.25 52.0 2.0 48.500000 ... 0.75 1.0 2.0 0.000000 0.000000 0.0 0.0 0.0 0.00 0.0
Master. 36.0 4.574167 3.619872 0.42 1.000 3.5 8.00 12.0 40.0 6.916750 ... 1.00 1.0 40.0 3.675000 2.092569 1.0 2.0 3.0 5.00 10.0
Miss. 146.0 21.773973 12.990292 0.75 14.125 21.0 30.00 63.0 182.0 23.005495 ... 1.00 1.0 182.0 1.263736 1.999089 0.0 0.0 0.0 2.00 10.0
Mlle. 2.0 24.000000 0.000000 24.00 24.000 24.0 24.00 24.0 2.0 24.000000 ... 1.00 1.0 2.0 0.000000 0.000000 0.0 0.0 0.0 0.00 0.0
Mme. 1.0 24.000000 NaN 24.00 24.000 24.0 24.00 24.0 1.0 24.000000 ... 1.00 1.0 1.0 0.000000 NaN 0.0 0.0 0.0 0.00 0.0
Mr. 398.0 32.368090 12.708793 11.00 23.000 30.0 39.00 80.0 517.0 31.362669 ... 0.00 1.0 517.0 0.441006 1.154239 0.0 0.0 0.0 0.00 10.0
Mrs. 108.0 35.898148 11.433628 14.00 27.750 35.0 44.00 63.0 125.0 34.824000 ... 1.00 1.0 125.0 1.528000 1.347495 0.0 1.0 1.0 2.00 7.0
Ms. 1.0 28.000000 NaN 28.00 28.000 28.0 28.00 28.0 1.0 28.000000 ... 1.00 1.0 1.0 0.000000 NaN 0.0 0.0 0.0 0.00 0.0
Rev. 6.0 43.166667 13.136463 27.00 31.500 46.5 53.25 57.0 6.0 43.166667 ... 0.00 0.0 6.0 0.333333 0.516398 0.0 0.0 0.0 0.75 1.0
Sir. 1.0 49.000000 NaN 49.00 49.000 49.0 49.00 49.0 1.0 49.000000 ... 1.00 1.0 1.0 1.000000 NaN 1.0 1.0 1.0 1.00 1.0
the 1.0 33.000000 NaN 33.00 33.000 33.0 33.00 33.0 1.0 33.000000 ... 1.00 1.0 1.0 0.000000 NaN 0.0 0.0 0.0 0.00 0.0

17 rows × 72 columns

df['Title'].value_counts()>10
Mr.           True
Miss.         True
Mrs.          True
Master.       True
Dr.          False
Rev.         False
Mlle.        False
Col.         False
Major.       False
Jonkheer.    False
Mme.         False
Sir.         False
Don.         False
Ms.          False
Capt.        False
Lady.        False
the          False
Name: Title, dtype: bool
frequencies = df['Title'].value_counts()

condition = frequencies<10   # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'Other')

df['New_Title'] = df['Title'].replace(mask_dict) 
title_dict=df.groupby('New_Title')['Age'].median().to_dict()
idx=df['Age'].isnull()
# df.loc[idx,"Age"]=
df['Age2']=df['Age'].copy()
df.loc[idx,'Age2']=df.loc[idx,"New_Title"].map(title_dict)
df.loc[idx,['New_Title','Age','Age2']].head()
New_Title Age Age2
5 Mr. NaN 30.0
17 Mr. NaN 30.0
19 Mrs. NaN 35.0
26 Mr. NaN 30.0
28 Miss. NaN 21.0

Replace Null Values of Fare

fare_median = df["Fare"].median()
fare_median
14.4542

Creating Dummy Variables for Embarked and Sex

from sklearn import preprocessing
df = pd.get_dummies(df,columns=['Embarked','Sex'],drop_first=True)
le = preprocessing.LabelEncoder()
le.fit(["C", "Q", "S"])
list(le.classes_)
['C', 'Q', 'S']

Fit the Model

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
y = df['Survived']

X = df[['Age_Fill_Med','Pclass','Sex_male','Fare']]

Random Forest Model and Cross Validation

# Original Random Forest with Eyal
rf = RandomForestClassifier(n_estimators=20,random_state=0)

# # Random Forest with Null Ages replaced with Median of Age
s1 = cross_val_score(rf,X,y,cv=10)

print('X: '+'{:.2f}% +- {:.2f}%'.format(100*np.mean(s1),100*np.std(s1)))
X: 80.93% +- 4.03%

Grid Search will test various hyperparameters and choose the best values for each. With a Random Forest classifier we have 3 main hyperparameters, max depth, max features, and min samples split.

Max depth is the depth of each tree in the forest. The depth is the number of splits a tree has. For example, setting max_depth = 3 will limit the number of splits per tree to 3.

Max features is the number of features in each tree. Limiting the number of features will prevent overfitting.

Min samples split is the number of records required to split on a given node. For example, the model with not split again if the number of records is less than this parameter.

from sklearn.model_selection import RandomizedSearchCV
from time import time
from scipy.stats import randint as sp_randint

def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1,5),
              "min_samples_split": sp_randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"],
              "n_estimators": sp_randint(10,100)}

# run randomized search
n_iter_search = 30
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
RandomizedSearchCV took 9.50 seconds for 30 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.835 (std: 0.021)
Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 1, 'min_samples_split': 8, 'n_estimators': 74}

Model with rank: 2
Mean validation score: 0.831 (std: 0.029)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 3, 'min_samples_split': 8, 'n_estimators': 76}

Model with rank: 3
Mean validation score: 0.824 (std: 0.020)
Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 3, 'min_samples_split': 7, 'n_estimators': 60}
rf = random_search.best_estimator_.fit(X,y)
rf.predict(X)
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)
pd.DataFrame(rf.feature_importances_, index = X.columns
             , columns = ['Feature Importance']).sort_values(by = 'Feature Importance', ascending = False)
Feature Importance
Fare 0.330700
Age_Fill_Med 0.280745
Sex_male 0.280466
Pclass 0.108088
# run rf.predict() on the same columns as train data but using the test data once transformations are complete

Test model using the test dataset

Import Test data

Import test.csv and save it to dataframe test_df

test = pd.read_csv('test.csv')
test_df = pd.DataFrame(test)
test_df.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
# Fill NAs with median of training set
test_df["Age_Fill_Med"] = test_df["Age"].fillna(med_age)
test_df["Fare"] = test_df["Fare"].fillna(fare_median)
# Convert Embarked and Sex columns to dummy variables
test_df = pd.get_dummies(test_df,columns=['Embarked','Sex'])
# Remove unnecessary columns
test_df.drop(['Sex_female', 'Embarked_C' ], axis = 1, inplace = True)
# Explore columns for model
df[['Age_Fill_Med','Pclass','Sex_male','Fare']].describe()
Age_Fill_Med Pclass Sex_male Fare
count 891.000000 891.000000 891.000000 891.000000
mean 29.361582 2.308642 0.647587 32.204208
std 13.019697 0.836071 0.477990 49.693429
min 0.420000 1.000000 0.000000 0.000000
25% 22.000000 2.000000 0.000000 7.910400
50% 28.000000 3.000000 1.000000 14.454200
75% 35.000000 3.000000 1.000000 31.000000
max 80.000000 3.000000 1.000000 512.329200
# Add survival prediction to the test_df dataframe
test_df['Survived'] = rf.predict(test_df[['Age_Fill_Med','Pclass','Sex_male','Fare']])
print('Train survival mean:  ' + str(round(df['Survived'].mean(),4)))
print('Test survival rate:  ' +str(round(test_df['Survived'].mean(),4)))
Train survival mean:  0.3838
Test survival rate:  0.3278

Create and save submission file

submission = test_df[['PassengerId', 'Survived']]
submission.head()
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
# Save submission to csv
# submission.to_csv('AR_Titanic_Submission.csv')

Conclusion

In conclusion, we see that the most important features to surviving the Titanic are age, gender, and fare.