There is a famous “Getting Started” machine learning competition on Kaggle, called Titanic: Machine Learning from Disaster. It is just there for us to experiment with the data and the different algorithms and to measure our progress against benchmarks. We are given the data about passengers of Titanic. Our goal is to predict which passengers survived the tragedy.
There are multiple useful tutorials out there discussing the same problem and dataset 12. This tutorial is geared towards people who are already familiar with R willing to learn some machine learning concepts, without dealing with too much technical details.
In part 1, we will know the data a little bit and prepare it for further analysis.
Loading Data & Initial Analysis
Data is given as two separate files for training and test. Our goal is to predict Survived variable for the test dataset. We will use the training set to learn from the data.
I have moved my user-defined functions to library.R file to keep the code clean here. You can check it out at the GitHub repository for this project.
I’m using the printr package for a better-looking print output. You can download it if you liked the output.
# Load data
train.data<-read.csv("input/train.csv",stringsAsFactors=F)test.data<-read.csv("input/test.csv",stringsAsFactors=F)# Keep the outcome in a separate variable
train.outcome<-train.data$Survivedtrain.outcome<-factor(train.outcome)# Remove "Survived" column
train.data<-train.data[,-2]train.len<-nrow(train.data)test.len<-nrow(test.data)# Combine training & testing data
full.data<-rbind(train.data,test.data)# Create factor version of categorical vars
# Split data back into training and testing
train.data<-full.data[1:train.len,]test.data<-full.data[(train.len+1):nrow(full.data),]# Check the summary of quantitative and categorical variables
# Add the outcome column
So far, we have put together the training and test data, converted the categorical variables into factor, and then separated the data sets. It is easier to apply the transformations this way, rather than doing the same thing twice on different data sets. Finally, we summarized the quantitative and categorical data to get some sense of the training data.
The following definitions are given at the competition website:
Most people didn’t survive (62%). But the survival rate is not the same across different groups. Females had higher chance of survival, 74% as compared to 19% for men. Class 1 passengers had 63% chance of survival, compared to 47% and 24% for class 2 and 3, respectively.
We see that survival rate is different across different classes, but we’re not sure yet if this is the result of different proportion of females across different passenger classes. Let’s check if this is the case.
The Freq3 column shows survival proportion of people with the same gender and passenger class. The survival rate in different classes may have some relationship to percentage of women in those classes. Obviously, the male passengers have a disadvantage across the board.
The data is partitioned into 68% for training dataset and 32% for test data set.
ggplot(train.data,aes(Sex.factor,fill=Survived))+geom_bar(stat="bin",position="stack")+ggtitle("Women have higher chance of survival")
ggplot(train.data,aes(x=Pclass.factor,fill=Survived))+geom_bar(stat="bin",position="stack")+ggtitle("In class 1 most people survived and in class 3 most did not")
ggplot(train.data,aes(x=Pclass.factor,fill=Survived))+geom_bar(stat="bin",position="stack")+facet_wrap(~Sex.factor)+ggtitle("Most women survived across passenger classes")
Although we have more that 10% missing values for Age variable in the training data, we can still see some patterns regarding the effect of passenger’s age on survival. Specifically, kids and teenagers had a better chance of survival while elderly were at a disadvantage.