Prediction of Kaggle Titanic Survival

serispoorthi
Feb 25, 2022
2 min read

Overview

This is the Titanic ML competition. Using machine learning need to create a model that predicts which passengers survived the Titanic shipwreck.

Once ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data.

The sinking of the Titanic is one of the most infamous shipwrecks in history.

In this challenge, we are asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In this competition, we want you to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die. We gain access to three similar datasets that include passenger information like name, age, gender, socio-economic class, etc. There are three files in the data: (1) train.csv, (2) test.csv, and (3)gender_submission.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The values in a column ("Survived") can be used to determine whether each passenger survived or not:

if it's a "1", the passenger survived.
if it's a "0", the passenger died.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. Note that test.csv does not have a "Survived" column - this information is hidden and we need to predict these outcomes.

Using the patterns found in the train.csv data, need to predict whether the other 418 passengers on board (found in test.csv) survived.

The gender_submission.csv file is provided as an example that shows how to structure your predictions. It predicts that all female passengers survived, and all male passengers died. But, just like this file, submission has:

a "PassengerId" column containing the IDs of each passenger from test.csv.
a "Survived" column (that you will create!) with a "1" for the rows where you think the passenger survived, and a "0" where you predict that the passenger died.

Loading data into Pandas dataframe

We can see that the csv file contains information about the people on board the Titanic. PassengerId is just an identification number given to each passenger, Survived column tells us whether they survived the sinking (1) or they did not (0).

Code

Percentage of men and women who survived

So our hypothesis is not fully incorrect, almost 75% of women survived the sinking, and only 19% of men survived. So if we were to go ahead and use just the Sex column for our prediction we can get around 60% to 70% accuracy. We will use Random Forrest Classifier model as our machine learning technique. Once the model is trained we can fit it on our test data to get our required predictions. We save these predictions in submissions.csv file and submit the results to kaggle.

Final Submissions With the above analysis, we get a score of 0.77511 which means our model was able to correctly predict 77.511% of the entries. This is a definite improvement on our hypothesis.

Prediction of Kaggle Titanic Survival

Recent Posts

Comments