When working on machine learning problems I find that productivity is sometimes somewhat low, because I have to spend a lot of time working on tedious bits that probably can be automated away. Over time I have found quite a few tools that help me get better results faster. Here's three of them.
Pandas profiling
Pandas is one of the most widely used libraries for loading and processing data in Python. It has a great set of features to perform various statistical operations on your data.
One of these methods is the describe
method, which gives you a compact summary of your data on the terminal or inside your Python notebook.
This method is very basic, maybe a little too basic for anyone that's serious about machine learning.
There is an alternative, called Pandas profiling. This library generates a complete report for your dataset, which includes:
- Basic data type information (which columns contain what)
- Descriptive statistics (mean, average, etc.)
- Quantile statistics (tells you about how your data is distributed)
- Histograms for your data (again, for visualizing distributions)
- Correlations (Let's you see what's related)
Using this library is simple:
import pandas as pd
import pandas_profiling
df = pd.read_csv('my_data.csv')
pandas_profiling.ProfileReport(df)
This outputs a bunch of HTML, containing all the information mentioned above.
For me, this tool saves a lot of time. Normally I spend quite a bit of time typing in all the commands to get the various statistics. Now I just need one to achieve the same results.
Download the tool here: https://github.com/pandas-profiling/pandas-profiling
FeatureTools
Another thing that takes a lot of time when building a model is feature engineering. On an average project you will spend about 80% of your time extracting and transforming data into a proper machine learning dataset.
You can however reduce this time by quite a bit if you use the right tools for the job. One such tool is the FeatureTools
library.
This tool automatically gathers up features from a bunch of tables and transforms these features into a proper machine learning dataset.
It works like this:
import featuretools as ft
entities = {
'customers': (customers_df, 'customer_id'),
'orders': (orders_df, 'order_id')
}
relationships = [
('customers','customer_id','orders','customer_id')
]
feature_matrix, feature_defs = ft.dfs(
entities=entities,
relationships=relationships,
target_entity='customers')
First we define the entities in our database, which in our case are customers and orders. Next we define how orders are related to customers. The customer is the parent entity (one) and orders is the child entity (many).
We then ask FeatureTools to build a dataset for us, where we choose customers as the target entity. This tells FeatureTools to produce a dataset with customer as the parent and engineer features from orders as observations related to each customer.
As you can see, FeatureTools is especially useful when you have a large database with many tables and you need to extract a machine learning dataset from that database. It also works great on temporal data.
The sample I've shown is rather basic, there's loads more you can do. You can specify exactly which feature engineering primitives the tool should use, such as sum, count, average, etc.
Please note that FeatureTools doesn't handle normalization, scaling and other operations that you would need to solve some of the common issues with data.
Still, it's a great tool that's quite extensible and saved me a lot of time.
You can download it here: https://www.featuretools.com/
LIME
If you're an active machine learning engineer you may have noticed that more customers than ever ask how a model is coming to a specific decision. They no longer blindly trust the model.
This can be quite a challenge as most models are hard to explain to a customer. But there is a solution.
LIME (Local Interpretable Model Explainer) is a tool that allows you to explain a decision made by your classification model. This can either be a decision tree, random forest or even a neural network.
For example if you have a neural network that predicts a label for an image you can explain its functionality with the following code:
from lime.lime_image import LimeImageExplainer
explainer = LimeImageExplainer()
explanation = explainer.explain_instance(
image,
keras_model.predict,
top_labels=5,
hide_color=0
num_samples=1000)
from skimage.segmentationskimage import mark_boundaries
temp, mask = explanation.get_image_and_mask(
295,
positive_only=True,
num_features=5,
hide_rest=False)
plt.imshow(mark_boundaries(temp / 2 + 0.5, mask))
First we create a new explainer for our image classifier. Then we let it explain the classification for a specific image. This produces an explanation object that we can visualize.
The visualization is done using matplotlib. Using the mark_boundaries
method we can highlight the edges of the area that explains why our image was classified the way it is.
You can find more about LIME here: https://github.com/marcotcr/lime
Give a shot and let me know!
Python has such a wealth of tools that it is hard not to find a good package for your machine learning problem. Did you find the above list useful? Let me know!