02 May 2016

## Machine Learning

In part 1 of this tutorial, we analyzed the data and prepared it for machine learning. Now, we are ready for some action.

### Quick Intro

We are interested in predicting an outcome (response) variable, given the other features (predictors) of our data points. This is called supervised learning, there is a set of oucomes in the training data to learn from and there will be a result for each given data point at the end. This is in contrast to unsupervised learning, where there is no response variable and the observations are grouped together based on a measure of similarity.

The response variable is categorical in this case. There are two classes of passengers, people who survived and people who did not. This type of problems is called classification, compared to regression problems where we predict a quantitative response.

### Decision Trees

Back to the data, we see that only 38.38% of people survived. So, if we start off by classifying everybody as `Survived= 0`, we will be wrong by 38%. Moreover, if we partition by gender and label females as `Survived= 1` and males as `Survived= 0`, we will be wrong by 25.8 for women and 18.89 for men. Overall, our error will be 21.32%, which is way better that the previous 38%.

This is exactly the way decision trees work. In each round, the algorithm picks the best variable that improves the result in the best way, partitions data based on that vairable, and at the end labels data by the majority in each leaf. We’ll see the example below.

`caret` is the umbrella package for machine learning using R. Different groups have developed different machine learning algorithms, where the signature of the methods are different. It means that it makes it hard to switch from one algorithm to the other. `caret` package solves this problem by unifying the interface for the main functions. `rpart` is one of the packages implementing the decision trees in R. We have trained the model to predict `Survived` using `Sex.factor` and `Pclass.factor` using `train.data` and instructed `caret` to use decision trees implemented by `rpart`. Then, predicted `Survived` using the same training data. Here, the `Survived` label in the data is ignored and we are given an array of labels as the result.

Using the `confusionMatrix` function, we can compare our predictions to actual classes of the data. This is only possible when we know the labels in advance, i.e. training data. The important measure for us is Accuracy, which is 78.68% here. This is the percentage of the cases we got right. Note this is 1 - 21.32% we calculated before.

Although we have taken the passenger class into account, the result is not any better than just considering the gender. We have suggested `Sex.factor` and `Pclass.factor` as the predictors for `Survival`, but `Pclass.factor` didn’t get picked up by `rpart` as it didn’t add any new information to the decision tree.

Next, we’ve plotted the decision tree using the `plot.decision.tree` function from `library.R` to see the current situation. Different colors here represent the predicted class, which is the class of majority in that specific node. At the root, we have the complete data set and different branches represent partitioned data based on the given conditions along the way. In this example, numbers at the right of the node represent proportion of the data with `Survived= 0` and the numbers at left are proportion who didn’t survive.

Now, let’s add `Fare` to the mix of predictors and see if we can get anything out of it. At the first glance, we have a higher accuracy rate. This may be a good sign. The decision tree have found a mix of `Pclass.factor= 3` and `Fare >= 23` conditions to provide a more accurate prediction. We used to predict that women survive with a 74% chance. Now, we can be more accurate. Women have a 95% chance of survival if they are not in class 3. For the women in class 3, the survival rate is 11% if the fare was more than \$23, and 59% otherwise. Let’s see if it makes sense. So, the decision tree discovered a special condition that makes a difference in survival. It would not be easy for us to discover on our own.

### Prediction

We are almost ready for our first prediction, but we’ve got a little problem.

Checking the `Fare` predictor, we see that it has a missing value in the test data. Since our decision tree needs to know this value, it cannot predict `Survive` for this data point. So, let’s impute the missing value with the median for the `Fare`.

Then, we have predicted the `Survive` class using `get.solution` function from `library.R`. It uses `predict` function and the given decision tree to predict the outcome for the given test data and builds the data frame the way Kaggle expects. Check the code below.

And finally, we write the prediction to a CSV file to be submitted to Kaggle.

When I submitted this file to Kaggle, I got a score of .78469. It is right above the benchmark titled “Gender, Price, and Class Based Model” (0.7799).

### Dealing With Missing Data

We already know that age can be a good predictor for survival. We also know that the decision tree algorithm cannot predict the outcome using the predictors with missing values. From our numerical analysis, we know there are 177 instances with `Age` missing. So, we need to impute the missing values.

Although we can still use the median age as the imputed value, it would not an optimal strategy. It is highly unlikely that everybody with a missing value for age is actually of the same age. We need to use a better estimate if possible.

We can use the average age of similar people as the missing value for age. The K-Nearest Neighbors (KNN) algorithm does this for us. We will use the value for `Pclass, Age, SibSp, Parch, Fare` to find similar nodes (neighbors) and use their mean age as the imputed age. We use the default value of 5 for `k`.

`preProcess` method of `caret` package is used to impute the missing values with `method= "knnImpute"`. Because `preProcess` centers and scales the data in the process of `knnImpute`, we need to convert them back to the original scale to be used as valid age values. The `Age` variable is updated to the imputed values for the people with missing `Age`. The differences still exist in the survival rate for people in different age ranges. Let’s check the decision tree of `Sex`, `Pclass`, and `Age` instead of `Fare`. We have a more detailed tree this time. Age especially matters among class 3 females, who survied by 50%. In this group, people over 38 survived by 8%, 21-38 year by 64%, 5.5-21 years by 32% and less than 5.5 by 73%.

We have an increase in accuracy for training data compared to using `Sex`, `Pclass`, and `Fare` from 81.03 to 81.71, but when submitted to Kaggle the score was .74641 as opposed to .78469 we had before. What is going on here?

We will look into this issue in part 3.