Logistic Regression Tutorial in R: Titanic Case study
If you are new to R, don’t worry, it is a very simple language to work with. It provides a lot of packages with many functionalities to make our life easier. If you don’t know R just visit our Basic R Tutorial.
The Problem we will solve
Here we will be predicting the chances of survival of passengers in Titanic based on some categorical variables such as gender, age and class of service.
- We will find which variables predict the survival the best
- Next, we will build a Predictive analytics model to output the actual chances of survival per passenger
Dataset
We will be working with Titanic dataset. Please use this link to download the data set.
Test Dataset – Download
Train Dataset – Download
Importing the dataset
First, we will import the data into R. We will use read.csv() command to import the dataset.
This command helps us to import .csv (comma separated value) file. It includes various arguments which are explained below.
File.choose () helps us to select the file which we have to import.
Header() is a logical value which indicates if the file contains variable names as its first line.
We have some missing values in our dataset which need to be encoded to NA (not available) using the na.strings=C(“ ”) command. Further missing value treatment is shown later in the tutorial.
Once the data is successfully imported you should see the data like this
Data Cleaning Process
A visual plot for missing values could help us understand the dataset better. It is critical to visualize the data and see patterns and relationships before we work on it in R. Many people visualize the data too late. So this step is important.
We will plot it using missmap() function provided by the Amelia package.
This will give a plot of the missing values in the given dataset.
We could see there are plenty of missing values in cabin variable so we will not use it for prediction. We will use column 2, 3, 5, 6, 7, 8, 10, 12 as independent variables for prediction as they do not have missing values.
So we will make a subset of these columns and use it as a new dataset. We will be using subset() function for this purpose.
When the subset is made it should look like this
Handling missing values
We could handle missing values by replacing it with mean, median, mode or by any global constant. We will be using the mean method to replace the values.
$ Symbol is used to select a particular column in the dataset. Here Age column is selected in the dataset for further use.
is.na() selects all the missing values from age column and replaces them with the mean of the corresponding column.
We will again plot the graph to check the missing values.
We will get the new plot without any missing values and if there are any of them left we will handle them later.
Model Fitting
Now we will split our data in 2 categories: – training and testing. Training dataset is that part of our data that will be used to fit or train our prediction model. After we train our model we then use the rest of the data, also called the Testing dataset or simply Test data to check the accuracy of the prediction model that we have built. In this case we have taken rows 1 to 800 as training dataset and the remaining as testing dataset.
We will now fit our model using a function called the glm() function.
Glm (generalized linear model) is a function which is used to fit a model on the basis of the symbolic description that is the formula of the predictor model provided as an argument.
The form of glm function is:
glm(formula,family=familytype(link=linkfunction),data=)
We will get the following results from the above model.
We derived the formula on the basis of Pclass and Sex of passengers.Deviance Residuals shows how well the response variable is predicted by a model. It gives various values as min, median, max and many others for prediction.
Coefficients give the estimated coefficients and their estimated standard errors. It also gives z value and the p value.
Fisher scoring iteration gives the maximum number of iterations beyond which there will not be any practical gains.
When we study the coefficients we observe variables age and gender are statistically significant (with p value < 0.001)
Most authors refer to statistically significant as P < 0.05 and statistically highly significant as P < 0.001 (less than one in a thousand chance of being wrong).
Now we will have glm model for the interaction between class and gender.
We will be having the new results as
We will predict the model for test data set using predict function. We will first import the test dataset first. The test dataset will appear like this:
We obtained the titanic_predict model as the probabilities of survival of passengers. We can see the first 6 predictions using the head() function.
We can see all the probabilities by titanic_predict.
This is the way we can predict the probability of survival of the passengers. We can visualize it through a histogram using command hist(titanic_predict).