View on GitHub

Dataiku-Titanic

This is my take on the Kaggle's Titanic challenge through the Dataiku's Machine Learning

Kaggle’s Titanic Challenge on Dataiku!

Introduction

enter image description here Titanic was under the command of Capt. Edward Smith, who also went down with the ship. The ocean liner carried some of the wealthiest people in the world, as well as hundreds of emigrants from Great Britain and Ireland, Scandinavia and elsewhere throughout Europe, who were seeking a new life in the United States. The first-class accommodation was designed to be the pinnacle of comfort and luxury, with a gymnasium, swimming pool, libraries, high-class restaurants and opulent cabins.

Installation

On this repository, you may find my personal projects related to Machine Learning, EDA, Python Jupyter Notebook and couple of Visualization based on the Dataiku Platform exported standard files. Most of the datasets I’ve been working with, downloaded from Kaggle. Installation pretty straight forward. Simply download the whole set as a single project as a ZIP files, everything have been flattened out with plain text files, and no SQL dump was involved, so there wouldn’t be any missing system dependencies issue.

Data Flow

Titanic Dataflow This is how I created the data flow visualization process, and by the end of it, I’m applying, both the Logistic Regression and the Decision Tree Model to supply the Machine Learning challenge.

Features Handling

Titanic Features Handling Many of the features of the dataset, have been modified through One Hot Encoding method, that way the ML algorithm would understand them and decipher them better by changing them to categorical units.

Initial Dataset Features

Titanic Initial Dataset Handling

Data Dictionary

By default, the initial dataset coming from Kaggle’s challenge page would give you the above dataset features at hand. But we’ll try to optimize them to something much more Machine Learning friendly looking dataset. And this is how I did it.

Name Column

This is something you would normally do in Python to extract the Name information from the dataset. The objective is to get the Title information, so that you may utilize them into something more categorical, so that the Machine Learning algorithm could understand them better.

train_test_data = [train, test] # combining train and test dataset
for dataset in train_test_data:
    dataset['Title'] = dataset['Name'].str.extract( '([A-Za-z]+)\.', expand=False)  

Here’s the similar method in Dataiku, in a way they produce the similar output, through their Data manipulation recipes canvas. Cleaning The Name Column

Name Remapping

Now let’s map a categorical number to depict those Title values.

You could do the similar in python with these following codes:

title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, 
                 "Master": 3, "Dr": 3, "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                 "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3, "Mme": 3,"Capt": 3,"Sir": 3 }
for dataset in train_test_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)

Whereas in the case of Dataiku, it’d be something to this degree.

Name Mapping

Sex Column

Now let’s move on to the next data column, the sex column. It’s pretty common that just about any Gender data values would give you ‘Male’ or ‘Female’ attributions. But categorical requirement, would still be needing you to change that to numerical attribution. In python, you could this to have the output:

sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping) Whereas in the prepare recipes, you might do something like this to achieve the similar output. <br /><br /> ![Gender Mapping](/Dataiku-Titanic/images/mapping-gender.png)

Jupyter Notebooks

Disclaimer

And please remember, as this is only a weekend pet project, which I’m doing them for my personal interest only.