Last week I teached a group of colleagues about machine learning. The goal for
the training was to remove the black box and learn more about what you can do
with machine learning. We also covered some discussions that arise when you start
to use machine learning.
There's a lot of things you need to think about when you start to apply machine
learning. Although it's not impossibly hard, there's still a lot of stuff you need
to think about.
In this post I will discuss 5 tips that can help to improve your machine learning
solution.
Tip 1: Visualize before you start
When you start to work on a new piece of software that uses machine learning
you typically focus on the problem and a fitting algorithm to learn a model
that will solve your problem.
It is pretty easy to get lost in finding the right algorithm and model for your
problem.
It may sound easy: Use a binary classifier when you want you have a problem where you have to predict whether something is positive or negative.
But before you dive into training the classifier, consider the data. Visualize
the data first and check what it looks like.
Is the data good enough to train a binary classifier with the algorithm you think
that might be useful? Do you have enough data, questions like that are important.
Using the wrong data makes a model, that is essentially wrong even worse. So before you start to program, check your data!
Tip 2: Use the simplest model possible
When you have the right data and you know what you can do with it to build a
model for your problem there's the challenge of picking a model that is
simple, yet good enough to fit the problem you're trying to solve.
Almost every machine learning problem can be represented using a neural network,
but it doesn't mean that you should do that. Neural networks are hard to train
and maintain. Most of the time a simpler model is better.
So when you work on a machine learning, pick a model that is simple to understand
and work with. But beware, don't make it too simple or you will be jumping to the
wrong conclusions.
Tip 3: Prefer multiple models over one big model
A simple model is the best when you want to solve a problem. But sometimes you
need a more complex model.
Some problems cannot be expressed in a single model. When you get to this point,
you are still better off using simple models.
However, instead of one model, try to split the problem into multiple separate
subproblems and use multiple models to build a solution for those problems.
Usually simpler models are easier to tune and when one model is wrong you can replace it with another model to improve the situation.
Problems in a solution with multiple machine learning algorithms are easier to
isolate and solve, because there are less factors involved.
Tip 4: When it comes to validation math is good, but experimentation is better
Validation of machine learning models is usually done by applying math. You can get quite far with this strategy. But it is not the end goal.
Make sure that when you have a model for your problem, you check the predictions of your model with real people.
Are you getting the results you expected to get? Are users of your solution happy with the predictions?
Math gives you a good sense of direction, but the final check should always be a human. It is too easy to get the outcome wrong even with a good validation method.
Because there's so many directions to go in with machine learning that at the beginning of the design process there's no real way to know if it's a good solution you are working on. The math can be quite right in this situation, but the users of your solution will be the only one that catches a wrong direction in the solution.
Tip 5: All models are wrong, but some are useful
Remember, all models are wrong, but some are useful. And because all models are wrong you can never be sure that your software is doing the right thing.
Make your model more useful. Spend some time to think about how wrong is acceptable for you. Sometimes, it's perfectly acceptable that your binary classifier has a high precision and low recall. Even if the theory says that you want a 50/50 balance between precision and recall.
Talk to your customer. Explain what the options are and what you think he/she should do. This will be the most important thing when you design a machine learning solution.