25 April 2016

## Context

There is a famous “Getting Started” machine learning competition on Kaggle, called Titanic: Machine Learning from Disaster. It is just there for us to experiment with the data and the different algorithms and to measure our progress against benchmarks. We are given the data about passengers of Titanic. Our goal is to predict which passengers survived the tragedy.

There are multiple useful tutorials out there discussing the same problem and dataset 1 2. This tutorial is geared towards people who are already familiar with R willing to learn some machine learning concepts, without dealing with too much technical details.

In part 1, we will know the data a little bit and prepare it for further analysis.

Data is given as two separate files for training and test. Our goal is to predict `Survived` variable for the test dataset. We will use the training set to learn from the data.

I have moved my user-defined functions to `library.R` file to keep the code clean here. You can check it out at the GitHub repository for this project.

I’m using the `printr` package for a better-looking print output. You can download it if you liked the output.

So far, we have put together the training and test data, converted the categorical variables into `factor`, and then separated the data sets. It is easier to apply the transformations this way, rather than doing the same thing twice on different data sets. Finally, we summarized the quantitative and categorical data to get some sense of the training data.

### Variable Definitions

The following definitions are given at the competition website:

Variable Definitions

• survival: Survival (0 = No; 1 = Yes)
• Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
• name: Name
• sex: Sex
• age: Age
• sibsp: Number of Siblings/Spouses Aboard
• parch: Number of Parents/Children Aboard
• ticket: Ticket Number
• fare: Passenger Fare
• cabin: Cabin
• embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Special Notes

• Pclass is a proxy for socio-economic status (SES): 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

• Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5

• With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

• Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
• Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
• Parent: Mother or Father of Passenger Aboard Titanic
• Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

• Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.
• Some children travelled only with a nanny, therefore parch=0 for them.
• As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

### Findings

• Most people travelled in Class 3.
• Most passengers were male.
• For the passengers with Age specified, 50% are older than 28. We have a lot of missing values in Age variable.
• Most passengers did not travel with their spouse or siblings on board.
• Most passengers do not have their parents or children on board.
• 75% of the passengers paid less than \$31 for fare, and the maximum fare paid was \$512.
• Most passengers came on board on the port specified by “S” (Southampton).

## Exploratory Data Analysis (EDA)

The next logical step is to do some exploratory analysis to get more familiar with the data at hand. Let’s take a look at different groups and their survival rate.

female male
35.24 64.76
Class.1 Class.2 Class.3
24.24 20.65 55.11

Majority of passengers are men (65%) and passengers have 3 different classes, 1 (24%), 2 (21%), or 3 (55%).

### Survival Rate

0 1
549 342
0 1
61.62 38.38
Sex/Survived 0 1
female 25.80 74.20
male 81.11 18.89
Pclass/Survived 0 1
Class.1 37.04 62.96
Class.2 52.72 47.28
Class.3 75.76 24.24

Most people didn’t survive (62%). But the survival rate is not the same across different groups. Females had higher chance of survival, 74% as compared to 19% for men. Class 1 passengers had 63% chance of survival, compared to 47% and 24% for class 2 and 3, respectively.

We see that survival rate is different across different classes, but we’re not sure yet if this is the result of different proportion of females across different passenger classes. Let’s check if this is the case.

Pclass/Sex female male
1 0.4351852 0.5648148
2 0.4130435 0.5869565
3 0.2932790 0.7067210

The classes with a better survival rate have higher proportion of females.

Pclass Sex Survived Freq
1 female 0 3.19
1 96.81
male 0 63.11
1 36.89
2 female 0 7.89
1 92.11
male 0 84.26
1 15.74
3 female 0 50.00
1 50.00
male 0 86.46
1 13.54

The Freq 3 column shows survival proportion of people with the same gender and passenger class. The survival rate in different classes may have some relationship to percentage of women in those classes. Obviously, the male passengers have a disadvantage across the board.

### Data Partitioning

The data is partitioned into 68% for training dataset and 32% for test data set.

## Graphical Analysis

Although we have more that 10% missing values for `Age` variable in the training data, we can still see some patterns regarding the effect of passenger’s age on survival. Specifically, kids and teenagers had a better chance of survival while elderly were at a disadvantage.

## Statistical Analysis

We saw that fare may have some effect on the survival rate. Let’s see if the effect is real.

Survived Pclass.factor Fare
0 Class.1 64.68401
1 Class.1 95.60803
0 Class.2 19.41233
1 Class.2 22.05570
0 Class.3 13.66936
1 Class.3 13.69489

So, we see that in class 1, the difference in fare is statistiscally significant between passengers who survived and who didn’t . Maybe the rich found a way to buy lifeboats :).

In part 2, we will start doing machine learning and submit our first prediction to Kaggle!

## References and Footnotes

1. http://trevorstephens.com/post/72916401642/titanic-getting-started-with-r

2. https://campus.datacamp.com/courses/kaggle-r-tutorial-on-machine-learning/chapter-1-raising-anchor?ex=1

3. I know it’s not a proper name for the column. It is the default name of the column used by `printr` package when the table is represented as a data frame.