Introduction

On April 10th, 1912, the RMS Titanic departed on its maiden voyage from Southampton to New York City. After making stops in Cherbourg and Queenstown to pick up additional passengers, the ship headed towards New York. There were 2,224 passengers including 892 crew members aboard the ship. This was under the ship’s capacity of around 3,300. The ship was equipped with watertight compartments that were designed to fill up in the event of a breach to the ship’s hull. This technology was said to make the ship “unsinkable”.

The Titanic was scheduled to arrive at New York Pier 59 on the morning of April 17th. But at around 11:40 p.m. on the night of April 14th, the ship collided with an iceberg. The ship’s hull was not punctured by the collision, however it weakened the seams of the hull causing them to separate. Several watertight compartments filled with water and the ship eventually sank in the Atlantic Ocean about 400 miles off the coast of Newfoundland.

Distress signals were sent out, but no ships were near enough to reach the Titanic before she sank. At around 4 a.m., the the RMS Carpathia arrived on scene in response to the distress signals The ship had 20 lifeboats which had capacity for only 1,178 passengers. Approximately 1,500 people lost their lives in this tragic event. About 710 people survived and were taken to New York via the Carpathia.

So, who were the passengers that survived this tragedy? Fortunately, we have data which can give us insight into who survived and who perished. The goal is to use machine learning methods to identify patterns in the data to predict which passengers survived and which ones did not.

The Dataset

The Titanic dataset comes from kaggle. The training set contains 891 rows and 12 columns with one row per passenger. The test set includes 418 rows and 11 columns (the survived column is missing for the sake of the competition).

The columns of the test set are:

PassengerId: Kaggle passenger id
Survived: 1 = passenger survived, 0 = passenger did not survive
Pclass: Ticket class
Name: Full name of passenger including title
Sex: Sex of passenger
Age: Age of passenger
SibSp: Number of siblings and spouses aboard the ship
Parch: Number of parents and children aboard the ship
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of embarkation

Goal of Analysis

sources:

https://en.wikipedia.org/wiki/RMS_Titanic
https://www.kaggle.com/c/titanic/data

Import Data

Load the standard Libraries

First, load the standard python libraries.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline

Next, we import the training data from the train.csv file and save it as data. After the data is loaded into our jupyter notebook, we convert it into a pandas dataframe and inspect it.

# import train data and save it as data
data = pd.read_csv('train.csv')

# load data into a pandas dataframe and save as df
df = pd.DataFrame(data)

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Data Exploration

Now that the data is loaded, we can begin exploration. The data has already been split into training and test sets. We will explore the training set and act as if the test set is not available to us at this time. This will allow us to get a better measure of the accuracy our model and avoid over fitting.

The describe function is used to view summaries of the numerical columns in the dataset. This also helps to identify columns with null values. Notice the count of Age compared to the other columns.

df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

The corr method calculates the correlation between variables.

c = df.corr()
c

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

# Get a count of all columns including text fields
df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

Explore Age

# Unique values for age
df.Age.unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

# distribution of passengers by age
df.groupby(pd.qcut(df['Age'],10))['Survived'].mean().plot(figsize=(8,6), kind='bar', color='blue', alpha=.2)

Explore P Class

df['Pclass'].value_counts(dropna = False)

  491
  216
  184
Name: Pclass, dtype: int64

# Survivors by Pclass

Explore Pclass

df.groupby('Pclass').describe()

	Age								Fare		...	SibSp		Survived
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Pclass
1	186.0	38.233441	14.802856	0.92	27.0	37.0	49.0	80.0	216.0	84.154687	...	1.0	3.0	216.0	0.629630	0.484026	0.0	0.0	1.0	1.0	1.0
2	173.0	29.877630	14.001077	0.67	23.0	29.0	36.0	70.0	184.0	20.662183	...	1.0	3.0	184.0	0.472826	0.500623	0.0	0.0	0.0	1.0	1.0
3	355.0	25.140620	12.495398	0.42	18.0	24.0	32.0	74.0	491.0	13.675550	...	1.0	8.0	491.0	0.242363	0.428949	0.0	0.0	0.0	0.0	1.0

3 rows × 48 columns

Survivors by Pclass

df.groupby('Pclass')['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

df.groupby('Pclass')['Survived'].mean().plot(kind='bar',width=0.8)

Explore Embarked Location

df['Embarked'].value_counts(dropna = False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

pd.crosstab(df['Pclass'],df['Embarked'])

Embarked	C	Q	S
Pclass
1	85	2	127
2	17	3	164
3	66	72	353

Survivors by Embarked Location

df.groupby('Embarked')['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64

df.groupby('Embarked')['Survived'].mean().plot(kind='bar',width=0.8)

Feature Engineering

Age Binning

To make the age field more informative, we can group passengers into age bins.

df['age_bin'] = pd.qcut(df['Age'],10)

pd.crosstab(df['age_bin'],df['SibSp'])

SibSp	0	1	2	3	4	5
age_bin
(0.419, 14.0]	20	24	6	7	16	4
(14.0, 19.0]	60	20	3	1	2	1
(19.0, 22.0]	55	9	3	0	0	0
(22.0, 25.0]	48	15	5	2	0	0
(25.0, 28.0]	45	14	2	0	0	0
(28.0, 31.8]	47	18	0	1	0	0
(31.8, 36.0]	63	26	1	1	0	0
(36.0, 41.0]	34	17	2	0	0	0
(41.0, 50.0]	50	26	2	0	0	0
(50.0, 80.0]	49	14	1	0	0	0

pd.crosstab(df['Parch'],df['SibSp'])

SibSp	0	1	2	3	4	5	8
Parch
0	537	123	16	2	0	0	0
1	38	57	7	7	9	0	0
2	29	19	4	7	9	5	7
3	1	3	1	0	0	0	0
4	1	3	0	0	0	0	0
5	2	3	0	0	0	0	0
6	0	1	0	0	0	0	0

df['family_members']=df['Parch']+df['SibSp']

df.groupby('family_members')['Survived'].agg(['mean', 'size'])

	mean	size
family_members
0	0.303538	537
1	0.552795	161
2	0.578431	102
3	0.724138	29
4	0.200000	15
5	0.136364	22
6	0.333333	12
7	0.000000	6
10	0.000000	7

df.groupby('age_bin')['Survived'].mean()

age_bin
(0.419, 14.0]    0.584416
(14.0, 19.0]     0.390805
(19.0, 22.0]     0.283582
(22.0, 25.0]     0.371429
(25.0, 28.0]     0.393443
(28.0, 31.8]     0.393939
(31.8, 36.0]     0.483516
(36.0, 41.0]     0.358491
(41.0, 50.0]     0.397436
(50.0, 80.0]     0.343750
Name: Survived, dtype: float64

Replacing Null Values of Age

# Save median age of train set to use with transform of test set
med_age = df["Age"].median()

# Replace null values of age with median age
df["Age_Fill_Med"] = df["Age"].fillna(med_age)

Get Title From Name

df['Title'] = df['Name'].apply(lambda s: s.split(', ', 1)[1].split(' ',1)[0])

#s=df['Name'].str.split(', ', 1)

pd.crosstab(df['Title'],df['family_members'])

family_members	0	1	2	3	4	5	6	7	10
Title
Capt.	0	0	1	0	0	0	0	0	0
Col.	2	0	0	0	0	0	0	0	0
Don.	1	0	0	0	0	0	0	0	0
Dr.	5	0	2	0	0	0	0	0	0
Jonkheer.	1	0	0	0	0	0	0	0	0
Lady.	0	1	0	0	0	0	0	0	0
Major.	2	0	0	0	0	0	0	0	0
Master.	0	3	15	4	2	9	3	3	1
Miss.	100	27	22	10	9	4	6	1	3
Mlle.	2	0	0	0	0	0	0	0	0
Mme.	1	0	0	0	0	0	0	0	0
Mr.	397	68	35	6	1	5	1	1	3
Mrs.	20	59	27	9	3	4	2	1	0
Ms.	1	0	0	0	0	0	0	0	0
Rev.	4	2	0	0	0	0	0	0	0
Sir.	0	1	0	0	0	0	0	0	0
the	1	0	0	0	0	0	0	0	0

df.groupby('Title').describe()

	Age								Age_Fill_Med		...	Survived		family_members
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Title
Capt.	1.0	70.000000	NaN	70.00	70.000	70.0	70.00	70.0	1.0	70.000000	...	0.00	0.0	1.0	2.000000	NaN	2.0	2.0	2.0	2.00	2.0
Col.	2.0	58.000000	2.828427	56.00	57.000	58.0	59.00	60.0	2.0	58.000000	...	0.75	1.0	2.0	0.000000	0.000000	0.0	0.0	0.0	0.00	0.0
Don.	1.0	40.000000	NaN	40.00	40.000	40.0	40.00	40.0	1.0	40.000000	...	0.00	0.0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Dr.	6.0	42.000000	12.016655	23.00	35.000	46.5	49.75	54.0	7.0	40.000000	...	1.00	1.0	7.0	0.571429	0.975900	0.0	0.0	0.0	1.00	2.0
Jonkheer.	1.0	38.000000	NaN	38.00	38.000	38.0	38.00	38.0	1.0	38.000000	...	0.00	0.0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Lady.	1.0	48.000000	NaN	48.00	48.000	48.0	48.00	48.0	1.0	48.000000	...	1.00	1.0	1.0	1.000000	NaN	1.0	1.0	1.0	1.00	1.0
Major.	2.0	48.500000	4.949747	45.00	46.750	48.5	50.25	52.0	2.0	48.500000	...	0.75	1.0	2.0	0.000000	0.000000	0.0	0.0	0.0	0.00	0.0
Master.	36.0	4.574167	3.619872	0.42	1.000	3.5	8.00	12.0	40.0	6.916750	...	1.00	1.0	40.0	3.675000	2.092569	1.0	2.0	3.0	5.00	10.0
Miss.	146.0	21.773973	12.990292	0.75	14.125	21.0	30.00	63.0	182.0	23.005495	...	1.00	1.0	182.0	1.263736	1.999089	0.0	0.0	0.0	2.00	10.0
Mlle.	2.0	24.000000	0.000000	24.00	24.000	24.0	24.00	24.0	2.0	24.000000	...	1.00	1.0	2.0	0.000000	0.000000	0.0	0.0	0.0	0.00	0.0
Mme.	1.0	24.000000	NaN	24.00	24.000	24.0	24.00	24.0	1.0	24.000000	...	1.00	1.0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Mr.	398.0	32.368090	12.708793	11.00	23.000	30.0	39.00	80.0	517.0	31.362669	...	0.00	1.0	517.0	0.441006	1.154239	0.0	0.0	0.0	0.00	10.0
Mrs.	108.0	35.898148	11.433628	14.00	27.750	35.0	44.00	63.0	125.0	34.824000	...	1.00	1.0	125.0	1.528000	1.347495	0.0	1.0	1.0	2.00	7.0
Ms.	1.0	28.000000	NaN	28.00	28.000	28.0	28.00	28.0	1.0	28.000000	...	1.00	1.0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Rev.	6.0	43.166667	13.136463	27.00	31.500	46.5	53.25	57.0	6.0	43.166667	...	0.00	0.0	6.0	0.333333	0.516398	0.0	0.0	0.0	0.75	1.0
Sir.	1.0	49.000000	NaN	49.00	49.000	49.0	49.00	49.0	1.0	49.000000	...	1.00	1.0	1.0	1.000000	NaN	1.0	1.0	1.0	1.00	1.0
the	1.0	33.000000	NaN	33.00	33.000	33.0	33.00	33.0	1.0	33.000000	...	1.00	1.0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0

17 rows × 72 columns

df['Title'].value_counts()>10

Mr.           True
Miss.         True
Mrs.          True
Master.       True
Dr.          False
Rev.         False
Mlle.        False
Col.         False
Major.       False
Jonkheer.    False
Mme.         False
Sir.         False
Don.         False
Ms.          False
Capt.        False
Lady.        False
the          False
Name: Title, dtype: bool

frequencies = df['Title'].value_counts()

condition = frequencies<10   # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'Other')

df['New_Title'] = df['Title'].replace(mask_dict) 

title_dict=df.groupby('New_Title')['Age'].median().to_dict()
idx=df['Age'].isnull()
# df.loc[idx,"Age"]=
df['Age2']=df['Age'].copy()
df.loc[idx,'Age2']=df.loc[idx,"New_Title"].map(title_dict)

df.loc[idx,['New_Title','Age','Age2']].head()

	New_Title	Age	Age2
5	Mr.	NaN	30.0
17	Mr.	NaN	30.0
19	Mrs.	NaN	35.0
26	Mr.	NaN	30.0
28	Miss.	NaN	21.0

Replace Null Values of Fare

fare_median = df["Fare"].median()
fare_median

14.4542

Creating Dummy Variables for Embarked and Sex

from sklearn import preprocessing

df = pd.get_dummies(df,columns=['Embarked','Sex'],drop_first=True)

le = preprocessing.LabelEncoder()
le.fit(["C", "Q", "S"])
list(le.classes_)

['C', 'Q', 'S']

Fit the Model

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

y = df['Survived']

X = df[['Age_Fill_Med','Pclass','Sex_male','Fare']]

Random Forest Model and Cross Validation

# Original Random Forest with Eyal
rf = RandomForestClassifier(n_estimators=20,random_state=0)

# # Random Forest with Null Ages replaced with Median of Age
s1 = cross_val_score(rf,X,y,cv=10)

print('X: '+'{:.2f}% +- {:.2f}%'.format(100*np.mean(s1),100*np.std(s1)))

X: 80.93% +- 4.03%

Randomized Grid Search

Grid Search will test various hyperparameters and choose the best values for each. With a Random Forest classifier we have 3 main hyperparameters, max depth, max features, and min samples split.

Max depth is the depth of each tree in the forest. The depth is the number of splits a tree has. For example, setting max_depth = 3 will limit the number of splits per tree to 3.

Max features is the number of features in each tree. Limiting the number of features will prevent overfitting.

Min samples split is the number of records required to split on a given node. For example, the model with not split again if the number of records is less than this parameter.

from sklearn.model_selection import RandomizedSearchCV
from time import time
from scipy.stats import randint as sp_randint

def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1,5),
              "min_samples_split": sp_randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"],
              "n_estimators": sp_randint(10,100)}

# run randomized search
n_iter_search = 30
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

RandomizedSearchCV took 9.50 seconds for 30 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.835 (std: 0.021)
Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 1, 'min_samples_split': 8, 'n_estimators': 74}

Model with rank: 2
Mean validation score: 0.831 (std: 0.029)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 3, 'min_samples_split': 8, 'n_estimators': 76}

Model with rank: 3
Mean validation score: 0.824 (std: 0.020)
Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 3, 'min_samples_split': 7, 'n_estimators': 60}

rf = random_search.best_estimator_.fit(X,y)

rf.predict(X)

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

pd.DataFrame(rf.feature_importances_, index = X.columns
             , columns = ['Feature Importance']).sort_values(by = 'Feature Importance', ascending = False)

	Feature Importance
Fare	0.330700
Age_Fill_Med	0.280745
Sex_male	0.280466
Pclass	0.108088

# run rf.predict() on the same columns as train data but using the test data once transformations are complete

Test model using the test dataset

Import Test data

Import test.csv and save it to dataframe test_df

test = pd.read_csv('test.csv')
test_df = pd.DataFrame(test)
test_df.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

# Fill NAs with median of training set
test_df["Age_Fill_Med"] = test_df["Age"].fillna(med_age)
test_df["Fare"] = test_df["Fare"].fillna(fare_median)

# Convert Embarked and Sex columns to dummy variables
test_df = pd.get_dummies(test_df,columns=['Embarked','Sex'])

# Remove unnecessary columns
test_df.drop(['Sex_female', 'Embarked_C' ], axis = 1, inplace = True)

# Explore columns for model
df[['Age_Fill_Med','Pclass','Sex_male','Fare']].describe()

	Age_Fill_Med	Pclass	Sex_male	Fare
count	891.000000	891.000000	891.000000	891.000000
mean	29.361582	2.308642	0.647587	32.204208
std	13.019697	0.836071	0.477990	49.693429
min	0.420000	1.000000	0.000000	0.000000
25%	22.000000	2.000000	0.000000	7.910400
50%	28.000000	3.000000	1.000000	14.454200
75%	35.000000	3.000000	1.000000	31.000000
max	80.000000	3.000000	1.000000	512.329200

# Add survival prediction to the test_df dataframe
test_df['Survived'] = rf.predict(test_df[['Age_Fill_Med','Pclass','Sex_male','Fare']])

print('Train survival mean:  ' + str(round(df['Survived'].mean(),4)))
print('Test survival rate:  ' +str(round(test_df['Survived'].mean(),4)))

Train survival mean:  0.3838
Test survival rate:  0.3278

Create and save submission file

submission = test_df[['PassengerId', 'Survived']]
submission.head()

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

# Save submission to csv
# submission.to_csv('AR_Titanic_Submission.csv')

Conclusion

In conclusion, we see that the most important features to surviving the Titanic are age, gender, and fare.

Share on

Twitter Facebook LinkedIn

How to Survive the Titanic

Adam Reynolds