Skip to main content

Building ML Pipelines using Scikit Learn and Hyper Parameter Tuning

Data Scientists often build Machine learning pipelines which involves preprocessing (imputing null values, feature transformation, creating new features), modeling, hyper parameter tuning. There are many transformations that need to be done before modeling in a particular order. Scikit learn provides us with the Pipeline class to perform those transformations in one go.

Pipeline serves multiple purposes here (from documentation):
  • Convenience and encapsulation: You only have to call fit and predict once on your data to fit a whole sequence of estimators.

  • Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once (hyper-parameter tuning/optimization).

  • Safety: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In this article, I will show you
  • How to build a complete pipeline using scikit-learn’s Pipeline module
  • Create custom transformers
  • Hyper-parameter tuning
For the entire analysis, I am using the Titanic dataset. I chose this dataset because most of them are familiar with this dataset. Let’s start the analysis by loading all the required libraries:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import f1_score, accuracy_score

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import xgboost
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

Load titanic dataset and split the dataset into train and test sets:

# Load data
def load_data(PATH):
    data = pd.read_csv(PATH)
    return data

titanic_data = load_data('https://raw.githubusercontent.com/mattdelhey/kaggle-titanic/master/Data/train.csv')

titanic_data.shape # (891, 11)

titanic_data.columns # ['survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked']

null_value_count = titanic_data.isnull().sum()
features_with_null_values = null_value_count[null_value_count != 0].index
features_with_null_values # ['age', 'cabin', 'embarked']

# Split data 80:20
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

X_train = train_data.drop(columns=["survived"])
y_train = train_data["survived"]

X_test = test_data.drop(columns=["survived"])
y_test = test_data["survived"]

There are 11 features (10 + target) in total. We can remove high cardinal features (['name', 'ticket', 'cabin']) for this analysis from the data. We remain with 7 features. Out of 7, there are 5 numerical features ('pclass', 'age', 'sibsp', 'parch', 'fare') and 2 categorical features ('sex', 'embarked'). The preprocessing for numerical and categorical features is different. So, let’s build pipelines for numerical and categorical features separately.

Preprocessing Numerical features:
  • Impute null values with median
  • Create new features. We can create 'family_count' by adding 'sibsp' (No. of siblings/spouses) and 'parch' (No. of parents/children)
  • Feature Scaling
Scikit learn provides a lot of transformers by default. For custom processing purposes, we can create our own Custom Transformers for eg. creating new features. As scikit-learn relies on duck typing (you check only for the presence of a given method or attribute), it’s very easy to create custom transformers just by implementing fit, transform and fit_transformer methods in a class.

For the fit() function, you can just return self. The main processing code will go into the transform() function. The fit_transform() is automatically available for us if we add TransformerMixin as a base class. We can also add BaseEstimator as a base class which automatically provides two functions: get_params() and set_params().

class CreateNewFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, indices):
        self.indices = indices
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Create a new feature 'family_count' by adding 'sibsp' and 'parch'
        """
        sibsp = self.indices[0]
        parch = self.indices[1]

        family_count = X[:, sibsp] + X[:, parch]
        X = np.c_[X, family_count]
        return X

Let’s build a pipeline for processing numerical features. It’s very easy.

family_count_indices = [2, 3]

numerical_pipeline = Pipeline([
    ('numerical_imputer', SimpleImputer(strategy='median')),
    ('create_new_features', CreateNewFeatures(family_count_indices)),
    ('feature_scaling', StandardScaler())
])

Preprocessing Categorical features:
  • Impute null values with mode
  • One hot encoding
categorical_pipeline = Pipeline([
    ('categorical_imputer', SimpleImputer(strategy='most_frequent')),
    ('categorical_encoder', OneHotEncoder())
])

Column Transformer:

We have applied transformations separately for numerical and categorical features. Scikit learn provides ColumnTransformer through which we can apply different transformations on different columns at the same time. Let’s see how we can do this.

drop_columns = ["name", "ticket", "cabin"]
numerical_columns = X_train.drop(columns=drop_columns).select_dtypes(exclude = "object").columns
categorical_columns = X_train.drop(columns=drop_columns).select_dtypes(include = "object").columns

# Column Transformer
column_pipeline = ColumnTransformer([
    ("numerical_pipeline", numerical_pipeline, numerical_columns),
    ("categorical_pipeline", categorical_pipeline, categorical_columns)
])

Each tuple takes 3 inputs: (name, pipeline, columns on which the transformation to be applied). Using ColumnTransformer, we can apply numerical and categorical transformations parallelly on the same dataset. Now, let’s create the full pipeline by dropping unnecessary features before transforming the features.

class DropFeatures(BaseEstimator, TransformerMixin):
    """
    Drop features
    """
    def __init__(self, drop_columns):
        self.drop_columns = drop_columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Drop features
        """
        X = X.drop(columns=self.drop_columns, axis=1)
        return X
    
drop_columns = ["name", "ticket", "cabin"]

# Full pipeline
full_pipeline = Pipeline([
    ('drop_features', DropFeatures(drop_columns)),
    ('column_transformer', column_pipeline)
])

Hurray! We created the pipeline for processing the input features.

NOTE: While building the pipeline, we have to make sure that all the estimators except the last one must be transformers. The output of the estimator is passed to the next estimator in the pipeline.

Now, I will show you the different ways of using the pipelines:
  1. Use pipeline for preprocessing features only
  2. Include modeling in the pipeline
  3. Hyper-parameter tuning
1. Use pipeline for preprocessing features only

We use the pipeline to pre-process the features and then do modeling on top of the processed dataset.

# Transform input data
X_train_processed = full_pipeline.fit_transform(X_train)

# Train data using XGBoost
model = xgboost.XGBClassifier(max_depth=4)
model.fit(X_train_processed, y_train)

We use the same pipeline to transform the test data and predict using the trained model.

# Transform test data using full_pipeline
X_test_processed = full_pipeline.transform(X_test)

# Predict on the processed data
y_pred = model.predict(X_test_processed)

# Evaluate on test data
accuracy_score(y_test, y_pred), f1_score(y_test, y_pred) # (0.815, 0.762)

2. Include modeling in the pipeline

In this case, we include modeling (for eg.: DecisionTreeClassifier()) in the pipeline by adding it to full_pipeline.

# Add DecisionTreeClassifier to the end of processing pipeline
pipeline_modeling = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', DecisionTreeClassifier())
])

Using the pipeline object, we can directly fit and predict on the data.

# Fit data
pipeline_modeling.fit(X_train, y_train)

# Predict on new data
y_pred = pipeline_modeling.predict(X_test)

# Score on new data. Returns the accuracy score
pipeline_modeling.score(X_test, y_test) # 0.782

3. Hyper-parameter tuning

The most interesting part of the article.

Before diving deep into the hyper-parameter tuning, let’s understand the power of using pipelines. First advantage is we can tune any parameter of any method that is in the pipeline. Second advantage is that we can tune different methods too. For example, in the categorical pipeline, we are using OneHotEncoder(). But, there are different methods like OrdinalEncoder(). We can tune the method along with the parameters.

Let’s define the parameter grid. We can also pass a list of parameter dictionaries to optimize. In the parameter grid, to define the parameters we want to tune, we have to use the transformer names that were used while creating the pipeline. To backtrack to the parameter, we should use double underscore (__). The below code gives you much clarity on how to backtrack and define the parameters.

parameter_grid = [
    {
        "preprocessing__column_transformer__numerical_pipeline__numerical_imputer__strategy": ['median', 'mean'],
        "preprocessing__column_transformer__numerical_pipeline__feature_scaling": [StandardScaler(), RobustScaler()],
        "model": [DecisionTreeClassifier()],
        "model__criterion": ["gini", "entropy"],
        "model__max_depth": [10, 20]
    },
    {
        "preprocessing__column_transformer__categorical_pipeline__categorical_encoder": [OneHotEncoder(), OrdinalEncoder()],
        "model": [RandomForestClassifier()],
        "model__max_depth": [10, 15, 25],
        "model__n_estimators": [100, 200],
        "model__bootstrap": [True, False]
    },
    {
        "model": [XGBClassifier()],
        "model__n_estimators": [10, 50, 100],
        "model__learning_rate": [0.01, 0.1, 1],
        "model__max_depth": [3, 6, 9],
        "model__min_child_weight": [1, 3]
    }
]

Pass the parameter_grid to GridSearchCV to initialize and call fit() function to find the optimal parameters.

# Initialize grid search
grid_search = GridSearchCV(pipeline_modeling, parameter_grid, cv=5, verbose=0)
# Fit data grid_search.fit(X_train, y_train) # Get best estimator grid_search.best_estimator_ # Score on new data grid_search.score(X_test, y_test) # 0.821

There is a slight improvement after the grid_search. In this way, we can easily try different transformations and select the best pipeline.

TIP: We can build pipelines with different transformations and save them for future purposes. We can add new transformations and functions as we go. This will be very helpful in times of competitions and personal use as well.

Here is the full code for your reference:

# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import f1_score, accuracy_score

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import xgboost
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

# Load data
def load_data(PATH):
    data = pd.read_csv(PATH)
    return data

titanic_data = load_data('https://raw.githubusercontent.com/mattdelhey/kaggle-titanic/master/Data/train.csv')

# Split data 80:20
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

X_train = train_data.drop(columns=["survived"])
y_train = train_data["survived"]

X_test = test_data.drop(columns=["survived"])
y_test = test_data["survived"]

# Create new features
class CreateNewFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, indices):
        self.indices = indices
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Create a new feature 'family_count' by adding 'sibsp' and 'parch'
        """
        sibsp = self.indices[0]
        parch = self.indices[1]
        family_count = X[:, sibsp] + X[:, parch]
        X = np.c_[X, family_count]
        return X

# Pipeline for processing numercial features
family_count_indices = [2, 3]
numerical_pipeline = Pipeline([
    ('numerical_imputer', SimpleImputer(strategy='median')),
    ('create_new_features', CreateNewFeatures(family_count_indices)),
    ('feature_scaling', StandardScaler())
])

# Pipeline for processing categorical features
categorical_pipeline = Pipeline([
    ('categorical_imputer', SimpleImputer(strategy='most_frequent')),
    ('categorical_encoder', OneHotEncoder())
])

drop_columns = ["name", "ticket", "cabin"]
numerical_columns = X_train.drop(columns=drop_columns).select_dtypes(exclude = "object").columns
categorical_columns = X_train.drop(columns=drop_columns).select_dtypes(include = "object").columns

# Column Transformer
column_pipeline = ColumnTransformer([
    ("numerical_pipeline", numerical_pipeline, numerical_columns),
    ("categorical_pipeline", categorical_pipeline, categorical_columns)
])
    
class DropFeatures(BaseEstimator, TransformerMixin):
    """
    Drop features
    """
    def __init__(self, drop_columns):
        self.drop_columns = drop_columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Drop features
        """
        X = X.drop(columns=self.drop_columns, axis=1)
        return X
    
drop_columns = ["name", "ticket", "cabin"]

# Full pipeline
full_pipeline = Pipeline([
    ('drop_features', DropFeatures(drop_columns)),
    ('column_transformer', column_pipeline)
])

# 1. Use pipeline for preprocessing features only

# Transform the input data
X_train_processed = full_pipeline.fit_transform(X_train)

# Train data using XGBoost
model = xgboost.XGBClassifier(max_depth=4)
model.fit(X_train_processed, y_train)

# Transform the data using full_pipeline
X_test_processed = full_pipeline.transform(X_test)

# Predict on the processed data
y_pred = model.predict(X_test_processed)

# Evaluate on test data
accuracy_score(y_test, y_pred), f1_score(y_test, y_pred) #(0.815, 0.762)

# 2. Include modeling in the pipeline

pipeline_modeling = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', DecisionTreeClassifier())
])

# Fit data
pipeline_modeling.fit(X_train, y_train)

# Predict on new data
y_pred = pipeline_modeling.predict(X_test)

# Score on the new data. Returns the accuracy score
pipeline_modeling.score(X_test, y_test) # 0.782

# 3. Hyper-parameter tuning

# Parameter grid
parameter_grid = [
    {
        "preprocessing__column_transformer__numerical_pipeline__numerical_imputer__strategy": ['median', 'mean'],
        "preprocessing__column_transformer__numerical_pipeline__feature_scaling": [StandardScaler(), RobustScaler()],
        "model": [DecisionTreeClassifier()],
        "model__criterion": ["gini", "entropy"],
        "model__max_depth": [10, 20]
    },
    {
        "preprocessing__column_transformer__categorical_pipeline__categorical_encoder": [OneHotEncoder(), OrdinalEncoder()],
        "model": [RandomForestClassifier()],
        "model__max_depth": [10, 15, 25],
        "model__n_estimators": [100, 200],
        "model__bootstrap": [True, False]
    },
    {
        "model": [XGBClassifier()],
        "model__n_estimators": [10, 50, 100],
        "model__learning_rate": [0.01, 0.1, 1],
        "model__max_depth": [3, 6, 9],
        "model__min_child_weight": [1, 3]
    }
]

# Initialize grid search
grid_search = GridSearchCV(pipeline_modeling, parameter_grid, cv=5, verbose=0)
# Fit data
grid_search.fit(X_train, y_train)

# Get best estimator
grid_search.best_estimator_

# Score on new data
grid_search.score(X_test, y_test) # 0.821

That’s it for this post. Thanks for reading till the end. I hope this will help you to write better codes, achieve good ranks in competitions.

Thank you so much for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog.

Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.

Comments

  1. Identifying trends, enabling self-service analytics, utilizing powerful visualizations and offering real-time online data analysis are becoming the standard in business operations, strategic development and, ultimately, indispensable tools in increasing profit. Click here

    ReplyDelete
  2. Thanks for your post. It's very helpful post for us. You can also visit test and tag melbourne for more Victor Steel related information. I would like to thanks for sharing this article here.

    ReplyDelete
  3. I read the above article and got some knowledge from your article which is about Mexico Import Data It's actually great and useful data for us. Thanks for sharing it.

    ReplyDelete
  4. From desktop solutions to online or server deployment, www.inetsoft.com has BI software to accommodate the needs of all business types, business sizes, and industries.

    ReplyDelete
  5. Amazing post, thanks for sharing such informative article. Useful and interesting. Take look at this too New Driveways Cumbria. Thanks!

    ReplyDelete
  6. I just need to say this is a well-informed article which you have shared here about hoodies. It is an engaging and gainful article for us. Continue imparting this sort of info, Thanks to you. blocked shower repairs perth

    ReplyDelete
  7. The subject is obviously well-researched, and the writing is excellent. Reading this article made me eager. Thank you for providing us with this post which contains excellent wording. Global Solid State Transformer Market

    ReplyDelete
  8. They offer a hundred free spins on the excessive RTP slot recreation Vegas Lux. You can even access the Live Casino games on your cell system. The games obtainable on the on-line casino usually are not only 카지노사이트.online honest however random as properly.

    ReplyDelete

Post a Comment

Popular posts from this blog

Google Colab - Increase RAM upto 25GB

Google colab is a free jupyter notebook that is hosted on Google cloud servers. We can use CPU, GPU and TPU for free. It helps you to write and execute your code. You can directly access this through your browser. If you want to use Google Cloud/AWS, it requires hell lot of setup. You have to spin a cluster, create a notebook and then use it. But, Google colab will be readily available for you to use it. You can also install libraries from the notebook itself. These notebooks are very useful for training large models and processing huge datasets. Students and developers can make use of this because it’s very difficult for them to afford GPUs and TPUs. I was trying to run a memory heavy job. The notebook crashed. Then, I came to know how I can increase the RAM. So, I thought of sharing it in my blog. There are some constraints with the notebook. You can run these notebooks for not more than 12 hours and you can use only 12 GB RAM. There is no direct method or button t

Skills required to become a Data Scientist

Data Science is one of the hottest areas in the 21st century. We can solve many complex problems using a huge amount of information. The way electricity has changed the world, information helps us to make our lives easier and comfortable. Every second, an enormous amount of data is being generated. The data may be in the form of text, image, speech or tabular. As there is a lot of growth in the field of Data Science, in recent years, most of the companies have started building their own Data Science teams to get benefited from the information they have. This has created a lot of opportunities and demand for Data Science in different domains. For the next 5+ years, this demand would continue to increase. If we have the right skills, companies are ready to offer salaries more than the market standards. So, this is the right time to explore and gain skills which enables you to enter into this field. We have discussed the importance and demand for data science in the market. Let’s disc

Top 35 frequently asked Data Science interview questions

Interviews are very stressful. We should prepare for the worse. So, we have to plan accordingly in order to crack them. In this blog, you will get to know the type of questions that will be asked during the interview. It also depends on the experience level and the company too. This blog is mainly focused on entry-level Data Science related jobs. If you haven’t read my previous blog-posts, I highly recommend you to go through them: Skills required to become a Data Scientist How to apply for a Data Science job? First of all, you must be thorough with your resume, mainly your Internship experience and academic projects. You will have at least one project discussion round. Take mock interviews and improve your technical and presentation skills, which will surely help in the interviews. Based on my experience, I have curated the topmost 35 frequently asked Data Science questions during the interviews. Explain the Naive Bayes classifier? In case of Regression, how do y

My Data Science Journey and Suggestions - Part 1

I always wanted to share my detailed Data Science journey. So, I have divided the whole journey from BTech first year to final year into 3 parts. I will share everything, without leaving a single detail, starting from my projects, internships to getting a job. You can follow the path that I have followed if you like my journey or create your own path. In 2015, I got a seat in Electronics and Communication Engineering (ECE), IIIT Sri City through IIT JEE Mains. Because of my rank in JEE Mains, I couldn’t get into the Computer Science department. I wanted to shift to Computer Science after my first year, but couldn’t due to some reasons. In our college, we have only two branches, CSE and ECE. For the first three semesters, the syllabus was the same for both the departments except for a few courses. This helped me to explore Computer Science. In the first 3 semesters, I took Computer Programming, Data Structures, Algorithms, Computer Organization, Operation Systems courses, wh

Exploratory Data Analysis and Data Preprocessing steps

Exploratory Data Analysis is the foremost step while solving a Data Science problem. EDA helps us to solve 70% of the problem. We should understand the importance of exploring the data. In general, Data Scientists spend most of their time exploring and preprocessing the data. EDA is the key to building high-performance models. In this article, I will tell you the importance of EDA and preprocessing steps you can do before you dive into modeling. I have divided the article into two parts: Exploratory Data Analysis Data Preprocessing Steps Exploratory Data Analysis Exploratory Data Analysis(EDA) is an art. It’s all about understanding and extracting insights from the data. When you solve a problem using Data Science, it is very important to have domain knowledge. This helps us to get the insights better according to the business problem. We can find the magic features from the data, which boost the performance. We can do the following with EDA. Get comfortable with

SHAP - An approach to explain the output of any ML model (with Python code)

Can we explain the output of complex tree models? We use different algorithms to improve the performance of the model. If you input a new test datapoint into the model, it will produce an output. Did you ever explore which features are causing to produce the output? We can extract the overall feature importance from the model, but can we get which features are responsible for the output? If we use a decision tree, we can at least explain the output by plotting the tree structure. But, it’s not easy to explain the output for advanced tree-based algorithms like XGBoost, LightGBM, CatBoost or other scikit-learn models. To explain the output for the above algorithms, researches have come up with an approach called SHAP. SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate ad

Latent Dirichlet Allocation - LDA (With Python code)

Latent Dirichlet Allocation, also known as LDA, is one of the most popular methods for topic modelling. Using LDA, we can easily discover the topics that a document is made of. LDA assumes that the documents are a mixture of topics and each topic contain a set of words with certain probabilities. For example, consider the below sentences: Apple and Banana are fruits. I bought a bicycle recently. In less than two years, I will buy a bike. The colour of the apple and bicycle are red. The output of LDA would look like this: Topic 1 : 0.7*apple + 0.3*banana Topic 2 : 0.6*bicycle + 0.4*bike Sentence 1 : [(Topic 1, 1), (Topic 2, 0)] Sentence 2 : [(Topic 1, 0), (Topic 2, 1)] Sentence 3 : [(Topic 1, 0.5), (Topic 2, 0.5)] Please note that the above probabilities are made up numbers for intuition. To extract the topics and probability of words using LDA, we should decide the number of topics (k) beforehand. Based on that, LDA discovers the topic distribution of docum

A year of experience as a Data Scientist

On June 3rd 2019, I joined ZS Associates as a Data Scientist after graduating from IIIT SriCity. It was my first job and was very happy to get placed as a Data Scientist through lateral hiring. If you haven’t read my Data Science journey, please read it here :) After joining, I had some awesome moments that I never experienced since childhood. I got a chance to stay in a 4 star or 5 star hotel multiple times. I got a chance to travel by flight. I travelled to Pune, Delhi and Bangalore. I saw Vizag, Pune, Delhi and Bangalore airports in less than six months. I loved it. A few office parties, outings during Diwali and New year celebrations. Above are some of the moments that I can never forget in my life. My first job allowed me to experience these first time moments. Enjoying life is more important than anything. If you don’t enjoy your life, you cannot achieve anything big. Okay, let’s go into the main topic in detail. Me (inner voice during BTech):

Complete Data Science Pipeline

Data Science is not just modelling. To extract value out from Data Science, it needs to be integrated with business and deploy the product to make it available for the users. To build a Data Science product, it needs to go through several steps. In this article, I will discuss the complete Data Science pipeline. Steps involved in building a Data Science product: Understanding the Business problem Data Collection Data Cleaning Exploratory Data Analysis Modelling Deployment Let us discuss each step in detail. Understanding the business problem: We use Data Science to solve a problem. Without understanding the problem, we can’t apply data science and solve it. Understanding the business is very important in building a data science product. The model which we build completely depends on the problem we are solving. If the requirement is different, we need to adjust our algorithm such that it solves the problem. For example, if we are build