There is a famous “Getting Started” machine learning competition on Kaggle, called Titanic: Machine Learning from Disaster. It is just there for us to experiment with the data and the different algorithms and to measure our progress against benchmarks. We are given the data about passengers of Titanic. Our goal is to predict which passengers survived the tragedy.
There are multiple useful tutorials out there discussing the same problem and dataset 12. This tutorial is geared towards people who are already familiar with R willing to learn some machine learning concepts, without dealing with too much technical details.
In part 1, we will know the data a little bit and prepare it for further analysis.
Loading Data & Initial Analysis
Data is given as two separate files for training and test. Our goal is to predict Survived variable for the test dataset. We will use the training set to learn from the data.
I have moved my user-defined functions to library.R file to keep the code clean here. You can check it out at the GitHub repository for this project.
1
source("library.R")
I’m using the printr package for a better-looking print output. You can download it if you liked the output.
# Load datatrain.data<-read.csv("input/train.csv",stringsAsFactors=F)test.data<-read.csv("input/test.csv",stringsAsFactors=F)# Keep the outcome in a separate variabletrain.outcome<-train.data$Survivedtrain.outcome<-factor(train.outcome)# Remove "Survived" columntrain.data<-train.data[,-2]train.len<-nrow(train.data)test.len<-nrow(test.data)# Combine training & testing datafull.data<-rbind(train.data,test.data)# Create factor version of categorical varsfull.data$Pclass.factor<-factor(full.data$Pclass)levels(full.data$Pclass.factor)<-paste0("Class.",levels(full.data$Pclass.factor))full.data$Sex.factor<-factor(full.data$Sex)full.data$Embarked.factor<-factor(full.data$Embarked)
1
str(full.data)
1
2
3
4
5
6
# Split data back into training and testingtrain.data<-full.data[1:train.len,]test.data<-full.data[(train.len+1):nrow(full.data),]# Check the summary of quantitative and categorical variablessapply(train.data[,-c(1,3,8,10)],summary)
1
2
# Add the outcome columntrain.data$Survived<-train.outcome
So far, we have put together the training and test data, converted the categorical variables into factor, and then separated the data sets. It is easier to apply the transformations this way, rather than doing the same thing twice on different data sets. Finally, we summarized the quantitative and categorical data to get some sense of the training data.
Variable Definitions
The following definitions are given at the competition website:
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Special Notes
Pclass is a proxy for socio-economic status (SES): 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.
Some children travelled only with a nanny, therefore parch=0 for them.
As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
Findings
Most people travelled in Class 3.
Most passengers were male.
For the passengers with Age specified, 50% are older than 28. We have a lot of missing values in Age variable.
Most passengers did not travel with their spouse or siblings on board.
Most passengers do not have their parents or children on board.
75% of the passengers paid less than $31 for fare, and the maximum fare paid was $512.
Most passengers came on board on the port specified by “S” (Southampton).
Exploratory Data Analysis (EDA)
The next logical step is to do some exploratory analysis to get more familiar with the data at hand. Let’s take a look at different groups and their survival rate.
Most people didn’t survive (62%). But the survival rate is not the same across different groups. Females had higher chance of survival, 74% as compared to 19% for men. Class 1 passengers had 63% chance of survival, compared to 47% and 24% for class 2 and 3, respectively.
We see that survival rate is different across different classes, but we’re not sure yet if this is the result of different proportion of females across different passenger classes. Let’s check if this is the case.
The Freq3 column shows survival proportion of people with the same gender and passenger class. The survival rate in different classes may have some relationship to percentage of women in those classes. Obviously, the male passengers have a disadvantage across the board.
Data Partitioning
1
percent(train.len/(train.len+test.len),digits=2)
1
percent(test.len/(train.len+test.len),digits=2)
The data is partitioned into 68% for training dataset and 32% for test data set.
Graphical Analysis
1
library(ggplot2)
1
2
3
ggplot(train.data,aes(Sex.factor,fill=Survived))+geom_bar(stat="bin",position="stack")+ggtitle("Women have higher chance of survival")
1
2
3
ggplot(train.data,aes(x=Pclass.factor,fill=Survived))+geom_bar(stat="bin",position="stack")+ggtitle("In class 1 most people survived and in class 3 most did not")
1
2
3
4
ggplot(train.data,aes(x=Pclass.factor,fill=Survived))+geom_bar(stat="bin",position="stack")+facet_wrap(~Sex.factor)+ggtitle("Most women survived across passenger classes")
Although we have more that 10% missing values for Age variable in the training data, we can still see some patterns regarding the effect of passenger’s age on survival. Specifically, kids and teenagers had a better chance of survival while elderly were at a disadvantage.
ggplot(train.data,aes(y=Fare,x=Survived))+geom_boxplot()+facet_wrap(~Pclass.factor,ncol=3,scales="free")+ggtitle("People who survived paid higher fare on average")
Statistical Analysis
We saw that fare may have some effect on the survival rate. Let’s see if the effect is real.
So, we see that in class 1, the difference in fare is statistiscally significant between passengers who survived and who didn’t . Maybe the rich found a way to buy lifeboats :).
In part 2, we will start doing machine learning and submit our first prediction to Kaggle!
I know it’s not a proper name for the column. It is the default name of the column used by printr package when the table is represented as a data frame. ↩