25 April 2016

Tutorial Table of Contents


Context

There is a famous “Getting Started” machine learning competition on Kaggle, called Titanic: Machine Learning from Disaster. It is just there for us to experiment with the data and the different algorithms and to measure our progress against benchmarks. We are given the data about passengers of Titanic. Our goal is to predict which passengers survived the tragedy.

There are multiple useful tutorials out there discussing the same problem and dataset 1 2. This tutorial is geared towards people who are already familiar with R willing to learn some machine learning concepts, without dealing with too much technical details.

In part 1, we will know the data a little bit and prepare it for further analysis.

Loading Data & Initial Analysis

Data is given as two separate files for training and test. Our goal is to predict Survived variable for the test dataset. We will use the training set to learn from the data.

I have moved my user-defined functions to library.R file to keep the code clean here. You can check it out at the GitHub repository for this project.

1
source("library.R")

I’m using the printr package for a better-looking print output. You can download it if you liked the output.

1
library(printr)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Load data
train.data <- read.csv("input/train.csv", stringsAsFactors = F)
test.data <- read.csv("input/test.csv", stringsAsFactors = F)

# Keep the outcome in a separate variable
train.outcome <- train.data$Survived
train.outcome <- factor(train.outcome)

# Remove "Survived" column
train.data <- train.data[,-2]

train.len <- nrow(train.data)
test.len <- nrow(test.data)

# Combine training & testing data
full.data <- rbind(train.data, test.data)

# Create factor version of categorical vars
full.data$Pclass.factor <- factor(full.data$Pclass)
levels(full.data$Pclass.factor) <- paste0("Class.", levels(full.data$Pclass.factor))

full.data$Sex.factor <- factor(full.data$Sex)
full.data$Embarked.factor <- factor(full.data$Embarked)
1
str(full.data)
## 'data.frame':	1309 obs. of  14 variables:
##  $ PassengerId    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Pclass         : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name           : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex            : chr  "male" "female" "female" "female" ...
##  $ Age            : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp          : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch          : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket         : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare           : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin          : chr  "" "C85" "" "C123" ...
##  $ Embarked       : chr  "S" "C" "S" "S" ...
##  $ Pclass.factor  : Factor w/ 3 levels "Class.1","Class.2",..: 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex.factor     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Embarked.factor: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
1
2
3
4
5
6
# Split data back into training and testing
train.data <- full.data[1:train.len, ]
test.data <- full.data[(train.len+1):nrow(full.data), ]

# Check the summary of quantitative and categorical variables
sapply(train.data[, -c(1, 3, 8, 10)], summary)
## $Pclass
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.309   3.000   3.000 
## 
## $Sex
##    Length     Class      Mode 
##       891 character character 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177 
## 
## $SibSp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.523   1.000   8.000 
## 
## $Parch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3816  0.0000  6.0000 
## 
## $Fare
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.30 
## 
## $Embarked
##    Length     Class      Mode 
##       891 character character 
## 
## $Pclass.factor
## Class.1 Class.2 Class.3 
##     216     184     491 
## 
## $Sex.factor
## female   male 
##    314    577 
## 
## $Embarked.factor
##       C   Q   S 
##   2 168  77 644
1
2
# Add the outcome column
train.data$Survived <- train.outcome

So far, we have put together the training and test data, converted the categorical variables into factor, and then separated the data sets. It is easier to apply the transformations this way, rather than doing the same thing twice on different data sets. Finally, we summarized the quantitative and categorical data to get some sense of the training data.

Variable Definitions

The following definitions are given at the competition website:

Variable Definitions

  • survival: Survival (0 = No; 1 = Yes)
  • Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name: Name
  • sex: Sex
  • age: Age
  • sibsp: Number of Siblings/Spouses Aboard
  • parch: Number of Parents/Children Aboard
  • ticket: Ticket Number
  • fare: Passenger Fare
  • cabin: Cabin
  • embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Special Notes

  • Pclass is a proxy for socio-economic status (SES): 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

  • Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5

  • With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

  • Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
  • Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
  • Parent: Mother or Father of Passenger Aboard Titanic
  • Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

  • Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.
  • Some children travelled only with a nanny, therefore parch=0 for them.
  • As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Findings

  • Most people travelled in Class 3.
  • Most passengers were male.
  • For the passengers with Age specified, 50% are older than 28. We have a lot of missing values in Age variable.
  • Most passengers did not travel with their spouse or siblings on board.
  • Most passengers do not have their parents or children on board.
  • 75% of the passengers paid less than $31 for fare, and the maximum fare paid was $512.
  • Most passengers came on board on the port specified by “S” (Southampton).

Exploratory Data Analysis (EDA)

The next logical step is to do some exploratory analysis to get more familiar with the data at hand. Let’s take a look at different groups and their survival rate.

1
2
3
4
prop.sex <- prop.table(table(train.data$Sex.factor, useNA = "ifany"))
prop.class <-  prop.table(table(train.data$Pclass.factor, useNA = "ifany"))

percent(prop.sex, digits = 2)
female male
35.24 64.76
1
percent(prop.class, digits = 2)
Class.1 Class.2 Class.3
24.24 20.65 55.11

Majority of passengers are men (65%) and passengers have 3 different classes, 1 (24%), 2 (21%), or 3 (55%).

Survival Rate

1
table(train.data$Survived, useNA = "ifany")
0 1
549 342
1
2
prop.survived <- prop.table(table(train.data$Survived, useNA = "ifany"))
percent(prop.survived, digits = 2)
0 1
61.62 38.38
1
2
3
4
prop.sex.survived <- prop.table(table(Sex= train.data$Sex.factor, Survived= train.data$Survived), margin = 1)
prop.class.survived <- prop.table(table(Pclass= train.data$Pclass.factor, Survived= train.data$Survived), margin = 1)

percent(prop.sex.survived, digits = 2)
Sex/Survived 0 1
female 25.80 74.20
male 81.11 18.89
1
percent(prop.class.survived, digits = 2)
Pclass/Survived 0 1
Class.1 37.04 62.96
Class.2 52.72 47.28
Class.3 75.76 24.24

Most people didn’t survive (62%). But the survival rate is not the same across different groups. Females had higher chance of survival, 74% as compared to 19% for men. Class 1 passengers had 63% chance of survival, compared to 47% and 24% for class 2 and 3, respectively.

We see that survival rate is different across different classes, but we’re not sure yet if this is the result of different proportion of females across different passenger classes. Let’s check if this is the case.

1
prop.table(table(Pclass= train.data$Pclass, Sex= train.data$Sex), margin = 1)
Pclass/Sex female male
1 0.4351852 0.5648148
2 0.4130435 0.5869565
3 0.2932790 0.7067210

The classes with a better survival rate have higher proportion of females.

1
2
prop.class.sex.survived <- prop.table(table(Pclass= train.data$Pclass, Sex= train.data$Sex.factor, Survived= train.data$Survived), margin = 1:2)
percent(prop.class.sex.survived, digits = 2)
Pclass Sex Survived Freq
1 female 0 3.19
    1 96.81
  male 0 63.11
    1 36.89
2 female 0 7.89
    1 92.11
  male 0 84.26
    1 15.74
3 female 0 50.00
    1 50.00
  male 0 86.46
    1 13.54

The Freq 3 column shows survival proportion of people with the same gender and passenger class. The survival rate in different classes may have some relationship to percentage of women in those classes. Obviously, the male passengers have a disadvantage across the board.

Data Partitioning

1
percent(train.len/(train.len + test.len), digits = 2)
## [1] 68.07
1
percent(test.len/(train.len + test.len), digits = 2)
## [1] 31.93

The data is partitioned into 68% for training dataset and 32% for test data set.

Graphical Analysis

1
library(ggplot2)
1
2
3
ggplot(train.data, aes(Sex.factor, fill= Survived)) +
    geom_bar(stat = "bin", position = "stack") +
    ggtitle("Women have higher chance of survival")

plot of chunk ga-2

1
2
3
ggplot(train.data, aes(x=Pclass.factor, fill= Survived)) +
    geom_bar(stat = "bin", position = "stack") +
    ggtitle("In class 1 most people survived and in class 3 most did not")

plot of chunk ga-2

1
2
3
4
ggplot(train.data, aes(x=Pclass.factor, fill= Survived))+
    geom_bar(stat = "bin", position = "stack") +
    facet_wrap(~Sex.factor) +
    ggtitle("Most women survived across passenger classes")

plot of chunk ga-2

1
2
3
4
ggplot(train.data, aes(x= Pclass.factor, y= Age, color= Survived)) +
    geom_violin() +
    geom_jitter(alpha= .5) +
    ggtitle("Age Matters!")
## Warning: Removed 177 rows containing non-finite values
## (stat_ydensity).
## Warning: Removed 177 rows containing missing values (geom_point).

plot of chunk ga-2

Although we have more that 10% missing values for Age variable in the training data, we can still see some patterns regarding the effect of passenger’s age on survival. Specifically, kids and teenagers had a better chance of survival while elderly were at a disadvantage.

1
2
3
4
ggplot(train.data, aes(x= Sex.factor, y= Age, color= Survived)) +
    geom_jitter() +
    geom_violin() +
    facet_wrap(~Pclass.factor)
## Warning: Removed 30 rows containing non-finite values (stat_ydensity).
## Warning: Removed 11 rows containing non-finite values (stat_ydensity).
## Warning: Removed 136 rows containing non-finite values
## (stat_ydensity).
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing missing values (geom_point).
## Warning: Removed 136 rows containing missing values (geom_point).

plot of chunk ga-3

1
2
3
4
ggplot(train.data, aes(y=Fare, x= Survived)) +
    geom_boxplot() +
    facet_wrap(~Pclass.factor, ncol = 3, scales = "free") +
    ggtitle("People who survived paid higher fare on average")

plot of chunk ga-3

Statistical Analysis

We saw that fare may have some effect on the survival rate. Let’s see if the effect is real.

1
aggregate(Fare ~ Survived + Pclass.factor, data=train.data, mean)
Survived Pclass.factor Fare
0 Class.1 64.68401
1 Class.1 95.60803
0 Class.2 19.41233
1 Class.2 22.05570
0 Class.3 13.66936
1 Class.3 13.69489
1
t.test(Fare ~ Survived, data= train.data, subset = train.data$Pclass.factor == "Class.1")$conf.int
## [1] -50.58826 -11.25978
## attr(,"conf.level")
## [1] 0.95
1
t.test(Fare ~ Survived, data= train.data, subset = train.data$Pclass.factor == "Class.2")$conf.int
## [1] -6.475511  1.188767
## attr(,"conf.level")
## [1] 0.95
1
t.test(Fare ~ Survived, data= train.data, subset = train.data$Pclass.factor == "Class.3")$conf.int
## [1] -2.319980  2.268933
## attr(,"conf.level")
## [1] 0.95

So, we see that in class 1, the difference in fare is statistiscally significant between passengers who survived and who didn’t . Maybe the rich found a way to buy lifeboats :).

In part 2, we will start doing machine learning and submit our first prediction to Kaggle!

References and Footnotes

  1. http://trevorstephens.com/post/72916401642/titanic-getting-started-with-r 

  2. https://campus.datacamp.com/courses/kaggle-r-tutorial-on-machine-learning/chapter-1-raising-anchor?ex=1 

  3. I know it’s not a proper name for the column. It is the default name of the column used by printr package when the table is represented as a data frame. 



blog comments powered by Disqus