In the past machine learning was something exclusively for the pythonistas or data scientists that knew how to use Python or R and had studied statistics. Not anymore, it has become a lot easier for developers to use machine learning. This also goes for .NET developers.
In april this year Microsoft launched the new ML.NET library. A library with classic machine learning algorithms implemented in C#. This means if you are programming C# you now have access to a large group of machine learning algorithms that you can use in your application.
What makes ML.NET so interesting?
Typically when you want to use machine learning in your application you have to build something in Python or R. But that means if you are writing C# you need to switch to another language to build your machine learning application.
I personally love the idea of using just one language in a project. It makes it easier for developers on the team to take on any job that we throw at them. Also, switching languages all the time is hard and annoying. I do this a lot and I can assure you, you have to retrain your muscle memory everytime you do this.
ML.NET may be young, but has a full set of machine learning algorithms. Even cooler, if you have someone on the team that made a deep learning model in Tensorflow, you can load that into ML.NET as well.
You still need to learn about different machine learning problems and how to approach them. But at least you can do it from C#.
Getting started with ML.NET
So how do you start? ML.NET is available as a nuget package. You can add this package to your project in two ways:
dotnet add package Microsoft.ML
Or right click your project, choose Manage nuget packages
and search for Microsoft.ML
. Once added you can start building your machine learning pipeline.
ML.NET uses the pipeline pattern as a way to express the process of training a model. When you build a machine learning solution you typically have these steps in your program:
- Load data from a data source
- Preprocess the data so that it is in a format the algorithm understands
- Use the preprocessed data to train the model
This looks like a pipeline and that's the reason why ML.NET uses this pattern. Let's take a look at how to build a training pipeline.
First step is to build a data source for the pipeline. There are a number of loaders you can use, one of them the TextLoader.
var env = new LocalEnvironment();
var reader = TextLoader.CreateReader(env,
ctx => (
lotArea: ctx.LoadFloat(4),
type: ctx.LoadText(15),
style: ctx.LoadText(16),
quality: ctx.LoadFloat(17),
condition: ctx.LoadFloat(18),
price: ctx.LoadFloat(80)),
new MultiFileSource("data/train.csv"), hasHeader: true, separator: ',');
When you call CreateReader
you need to provide a lambda that will tell the loader how to get data from the CSV file into a C# type. You can go strongly typed, or just use C# 7 tuples like in the sample above.
Once you have a data source, you need to convert the data so that it fits the algorithm.
Most machine learning algorithms need floating points. Which we don't have at this point in the code. So we need to extend the pipeline.
var estimator = reader.MakeNewEstimator()
.Append(row => (
price: row.price,
style: row.style.OneHotEncoding(),
condition: row.condition,
quality: row.quality,
lotArea: row.lotArea,
type: row.style.OneHotEncoding()
));
We take the reader and append a new pipeline component to it. This pipeline component takes the incoming data from the reader and transform it.
Some properties you simply need to copy as they already have the right format. Others you have to change.
For example, take a look at the style property. In my dataset this is encoded as a string representing different styles of homes. But the machine learning algorithm expects a number. So we tell the pipeline we want our `style` property to be one-hot encoded. The output now contains numbers representing the different values I have in my dataset.
Now that we have the values encoded, we need to make sure we have a target value to train on and a set of features to feed into the machine learning algorithm. For this you need another step in the pipeline:
estimator.Append(row => (
price: row.price,
features: row.condition.ConcatWith(
row.quality, row.style, row.type, row.lotArea).Normalize()
)
)
The price is copied to the output of the pipeline step. The features however are made by concatenating the different properties from the input into a single vector. Each column in the output vector features
represents a property we loaded earlier.
Now on to the final step, let's add the step that predicts the price of a house.
var regressionContext = new RegressionContext(env);
estimator.Append(row => (
price:row.price,
predictedPrice: regressionContext.Trainers.Sdca(
row.price,
row.features,
loss: new SquaredLoss())
));
To predict the price of a house we use a regression context. The regression context contains logic to build machine learning models that are used for regression purposes.
The predictedPrice is set by the Sdca algorithm that you initialize in this step. Additionally you need to copy the original price as well. To train the model you feed in the expected price and the features from the previous step. Additionally you need to provide a loss function. This loss function is used in the optimization of the machine learning model.
To be able to evaluate the model you need to return the expected value and the predicted value in this step. We'll get back to this later on.
Because now that you have a pipeline you can train it using the Fit
method.
var data = reader.Read(new MultiFileSource("data/train.csv"));
var model = estimator.Fit(data);
First load the data using the reader. Feed this data into the estimator pipeline we built earlier. This produces a trained model.
If you want to know how well the model performs, you need to use the evaluation logic provided by the regression context.
var metrics = regressionContext.Evaluate(model.Transform(data),
row => row.price, row => row.predictedPrice);
The Evaluate
method needs the real price and predicted price. You can provide this information by calling Transform
method on your estimator pipeline. Next you need to provide the field for the real value and the predicted value.
The output of the Evaluate
method is a set of metrics. This contains the error rate for your model and other relevant metrics.
Once a model is trained you can use it to make predictions. When making a prediction you still need to use the preprocessing stops earlier. The data doesn't come from a data source like a text file, but rather from your application.
class House
{
public float condition;
public float quality;
public string style;
public string type;
public float lotArea;
public float price;
}
class PredictedPrice
{
public float predictedPrice;
}
var predictor = model.AsDynamic.MakePredictionFunction<House, PredictedPrice>(env);
You can create a prediction function by calling AsDynamic
and then MakePredictionFunction
with the type information for the input and the output.
The properties on both of these types should match the fields you are using in your pipeline. In my case, I need to provide all the properties I used to train the model. Additionally, I created a PredictedPrice
class with a predictedPrice
property that matches the predictedPrice
property that I created at the end of the pipeline.
The predictor can now be used to make a prediction:
var prediction = predictor.Predict(new House
{
lotArea = 8450,
type = "1Fam",
style = "2Story",
quality = 7,
condition = 5
});
Console.WriteLine($"Predicted price: {prediction.predictedPrice}");
That's it, a strongly typed machine learning pipeline to predict housing prices.
Get adventuring, try it out!
This is the part where I have to warn you about the state of ML.NET. It is still in preview, version 0.6 is up on nuget at the moment. Which means you can use it, but things will change and break.
Probably not for your next production project. But great for experiments and projects that are starting out today.
So get adventuring and try it out!