Machine Learing tools come in many shapes and sizes. One of the trends in the
past few years has been the data lake. But what is a data lake, how does it
relate to machine learning and why should you care?
What is a data lake
Metaphorically a data lake is a giant pool of data from which you can draw insights.
In practice it's a combination of a scalable data repository that can hold data in any shape.
This data repository is typically combined with a tool that is capable of processing the
data in the repository.
Both of these components of a data lake need to be scalable. You will be storing
terabytes or even more in that thing. The tool you put on top of this mountain of data
must be capable of processing such huge amounts of data. Also, since you will be storing
several kinds of data. Not every record you gather is the same.
Well that's cool, but does that get me?
Having a repository to store all the data in your company is one thing and a good
tool to process that data is another thing. But that doesn't get you one penny.
First you need a way to feed data to the repository, which means you need to build
integrations between your software systems and the data lake.
Second you need something to do with that data. Data stored in a data lake is what it is,
a giant pool of data. Totally worthless unless you do something with it. This means
that you need a business problem you want to solve and people that know how to operate
the tool that is going to process the data in the data lake.
What does a data lake look like exactly?
Since we're talking a scalable system that is capable of storing and processing data,
you will be looking at something like a scalable database or distributed file system
to store the data in your lake.
When you ask Microsoft about a data lake they will point you towards Azure Data Lake,
which is in fact an implementation of Apache Hadoop with a pretty interface.
Others will do exactly the same, point you to Hadoop and tell you that it is the tool
for building a data lake.
Hadoop is indeed a scalable big data solution. Hadoop contains a filesystem provider
called HDFS which is a scalable filesystem that you can run across multiple servers
linking together their disks. The data is replicated and partitioned, so you can be
sure that data is available and reasonably safe on there.
To process the data stored in HDFS you can write MapReduce on top of Hadoop to process
the data. This enables you to process huge amounts of stored data in a reasonable amount
of time. Basically you write a program that is capable of splitting the data in multiple
sets that get processed by separate servers (the map part). The results are sent to a
central server afterwards to get the complete picture (the reduce part).
As you can see, Hadoop is a pretty good fit if you think of a data lake as a repository to
store huge amounts of data combined with a tool that is capable of processing that data.
Should I be using a data lake?
Whether you should build a data lake depends on the scenario you want to implement.
As I said before, a big pool of data is rubbish. You need something to do with that data.
In fact, the goal you are trying to achieve dictates whether you should
even build a data lake. Some data shouldn't be dumped in a data lake and processed later.
For example, if you are working with telemetry data such as temperature data, you are better
off building a stream processing solution. This kind of data requires real time processing to
be of any use unless you are working on a climate change measurement solution.
A data lake is only useful when you want to crunch numbers on data that is older. For example,
sales orders from last year are perfect for a data lake solution. They don't change and there isn't
a real immediate need for the data.
If you want to build machine learning solution as part of an intelligent system (you actively
control some process using your machine learning model) than store that data in a way that is
easier accessible.
Hadoop is really slow, we are talking minutes before you get a result. If you need faster results,
use a different tool. Cassandra for example is way better at getting data fast. And Apache Spark is
way better at processing data in memory.
Final thoughts
I personally feel that data lakes have a use, but don't look at them as the universal solution
to all your big data needs.
My top tip: Look at the scenario and gather the data specifically required for that scenario.
This keeps the data manageable. Be very wary when someone starts to talk about a data lake.
It's not for everyone.
Cheers!