Learn how to build flexible machine learning pipelines in scikit-learn

Published 11/7/2017 4:30:00 PM
Filed under Machine Learning

Setting up a machine learning algorithm involves more than the algorithm itself. You need to preprocess the data in order for it to fit the algorithm. It's this preprocessing pipeline that often requires a lot of work. Building a flexible pipeline is key. Here's how you can build it in python.

scikit-learn, a very clever toolkit

There are many machine learning packages for Python. One of the is scikit-learn otherwise known as sklearn on pip. This toolkit contains many machine learning algorithms and preprocessing tools.

sklearn is the goto toolkit when you've got something to do with machine learning. It contains a lot of useful things like:

  • Feature extraction tools for turning raw data into features that can be learned from.
  • Preprocessing tools to clean up data or enrich it with additional information.
  • Supervised machine learning algorithms to predict values or classify data.
  • Unsupervised machine learning algorithms to structure data and find patterns.
  • Pipelines to combine the various tools together into a single piece of code.

All of these tools are great for building machine learning applications. But in this post I want to spend some time specifically on the pipeline.

What is a pipeline?

A pipeline in sklearn is a set of chained algorithms to extract features, preprocess them and then train or use a machine learning algorithm.

This is how you create a pipeline in sklearn:

from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline(steps=[
  ('vectorize', CountVectorizer()),
  ('classify', DecisionTreeClassifier())
])

Each pipeline has a number of steps, which is defined as a list of tuples. The first element in the tuple is the name of the step in the pipeline. The second element of the tuple is the transformer.

The final step in the pipeline is the estimator. The estimator can be a classifier, regression algorithm, a neural network or even some unsupervised algorithm.

To train the estimator at the end of the pipeline you have to call the method fit on the pipeline and provide the data to train on.

raw_features = ['Hello world', 'Machine learning is awesome']
raw_labels = [0, 1]

pipeline.fit(raw_features, raw_labels)

Because you're using a pipeline you can put in raw data, which gets preprocessed automatically by the pipeline before running it through the estimator for training.

When you've trained the estimator in the pipeline you can use it to predict an outcome through the predict method.

sample_sentence = ['Hi world']

outcome = pipeline.predict(sample_sentence)

When predicting an outcome the pipeline preprocesses the data before running it through the estimator to predict the outcome.

That's the power of using a pipeline. You don't have to worry about differences in preprocessing between training and prediction. It is automatically done for you.

Building your own custom transforms

All components in sklearn can be used in a pipeline. However, it can be that you've got something that you want to use in the preprocessing pipeline that isn't already there.

No problem, you can build your own pipeline components:

from sklearn.base import TransformerMixin

class MyCustomStep(TransformerMixin):
  def transform(X, **kwargs):
    pass
    
  def fit(X, y=None, **kwargs):
    return self

A pipeline component is defined as a TransformerMixin derived class with three important methods:

  • fit - Uses the input data to train the transformer
  • transform - Takes the input features and transforms them

The fit method is used to train the transformer. This method is used by components such as the CountVectorizer to setup the internal mappings for words to vector elmeents. It gets both the features and the expected output.

The transform method only gets the features that need to be transformed. It returns the transformed features.

Note The transformers in the pipeline are not allowed to remove or add records to the input dataset. If you need such a feature you should build your own transformations outside the pipeline.

Do more with the pipeline using grid search

The pipeline is a useful construction for building machine learning software. And with the custom components you can extend its functionality well beyond what's included in the sklearn package.

There's one more cool trick that I want to show you. You can use the pipeline to perform, what's called, a grid search. With a grid search algorithm you can let the computer automatically discover the optimum hyperparameters for your algorithm.

Hyperparameters are the parameters that you set before training a model. For example, the learning rate for a gradient descent algorithm or the number of neurons in a neural network layer.

With grid search you can specify several variations of these parameters and let the computer train models for these parameter sets to come up with the best options for your algorithm.

Here's what a grid search looks like in sklearn:

param_grid = [
    { 'classify__max_depth': [5,10,15,20] }
]

grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(features, labels)

First you need to define a set of parameters that you want to try out with the grid search. The parameters are defined as follows. You prefix each parameter with the name of the step in the pipeline. You can append two underscores and then name the parameter on the component you want to modify. Finally you add a list of values that you want to test.

Notice that the param_grid is a list of dictionaries. You can test more than one scenario with different parametersets if you want.

After you've define the parameters for the grid search you feed those together with the pipeline to a new instance of the GridSearchCV class.

When you call fit on grid_search it will kick off the search process. After this is done you can look at the results by retrieving the property cv_results_ on the grid_search object.

The cv_results_ property contains a dictionary with the following interesting members:

  • rank_test_score - Contains the ranking for each model in the order that they were executed. The index of the model with rank 1 is the best model.
  • params - Contains the parameters used for each model in the order that they were executed. If you know the best model you can find the parameters for the pipeline at the same index.
  • mean_test_score - Contains the test scores for each model in the order that they were executed. You can take a look here to find out how well the models did.

Please note that if you've build your own transformer components you need to implement one additional method to support the grid search algorithm:

from sklearn.base import TransformerMixin

class MyCustomStep(TransformerMixin):
  def transform(X, **kwargs):
    pass
    
  def fit(X, y=None, **kwargs):
    return self
    
  def get_params(**kwargs):
    return { }

The get_params method returns the trainable parameters for your pipeline component. If you have any parameters that you want to be modified by the grid search, this is the place to add them.

Conclusion

I think you will agree with me that the grid search functionality is a nice addon and a must if you're trying to find ways to make your machine learning models better.

The pipeline in sklearn provides an important construction when it comes to using the machine learning algorithms from this package in an ergonomical fashion.

If you haven't done so you should definitely take a look at sklearn/scikit-learn. I can especially recommend the pipeline functionality since it makes building machine learning code a lot more straightforward.