Deploy a modern data warehouse with resource manager templates in minutes

Published 7/10/2019 5:42:59 AM
Filed under Data

Want to quickly try out a modern data warehouse scenario on Azure? Use these templates to get your modern data warehouse set up in minutes.

In the past few weeks I've had a few people ask me for help deploying a modern data warehouse in Azure. I could of course explain exactly how to build one. But since I have a limited number of keystrokes left in my life, I decided to create a resource manager template that you can use to deploy a modern data warehouse. The template follows the Azure reference architecture for a modern data warehouse.

The idea is that you can take the template I created and modify it to your needs. You can even take it into Azure DevOps and deploy it using a release pipeline if you want.

You can get the template from my Github account: https://github.com/wmeints/modern-datawarehouse

Here's how I created the template for deploying a modern data warehouse.

Solution architecture

As I mentioned at the start of this post, my template is based on the reference architecture that Microsoft created for a modern data warehouse.

Modern data warehouseA modern data warehouse lets you bring together all your data at any scale easily, and to get insights through analytical dashboards, operational reports, or advanced analytics for all your users.12345

The solution is based entirely of the principles of using a data lake as the central location where all corporate data is stored. Around that, we have several other components that support the use of the data lake.
Data is ingested using a data factory that copies data from sources into the data lake and then applies additional transformations as needed. Typically, you'll use the data factory to implement ELT (Extract, Load, Transform) processes.
Once the data is in the data lake, you can perform additional analysis on the data using Azure Databricks. This tool is also useful in case you need to transform data as part of a data pipeline defined the data factory. It is a little expensive, but I can't live without its high usability and flexibility.

At the right of the solution, you'll find the SQL data warehouse. I don't use this as I don't need it for my customers. However, it's good to know that it still has a place in the world. You can use the SQL data warehouse to store that in a more structured manner than you would in the data lake.
You can use the data lake as a source of integrated, corporate data for analysis purposes. In contrast, you should use a SQL data warehouse as a source of prepackaged data for reporting. In a sense, it's a model of some of the data in the lake.
You can move data that you've stored in the SQL data warehouse into analysis services for even more powerful modeling options. From there you can take it into PowerBI or another reporting tool that you're using.

I haven't included PowerBI in my template as it depends on many tools for which most people don't have the licenses. You can, of course, add this bit yourself.
Let's take a look at the resource manager template itself first. After that, we'll talk about deploying a modern data warehouse using the resource manager templates.

Creating the resource manager template

Given that we have the reference structure defined, we can deploy the resource in Azure pretty quickly.

However, it still takes a couple of hours to set up. Also, I can imagine that setting up this amount of components can be quite hard. So, I figured, why not turn it into a resource manager template?

The template I've created contains one resource group that you can give a name. In this resource group, you'll get the following resources:

  • Azure Data Factory
  • Azure Data Lake Store Gen2
  • Azure Databricks

I purposefully left out the SQL data warehouse and PowerBI stuff. You can add those to the template if you wish, but I rarely use them and figure that it would make the solution rather more complicated than necessary.

Let's explore a few highlights of the resource manager template so you'll understand a little more of what I've created. First, we'll look at the general structure of the template. After that, we'll explore naming conventions.

Structuring the templates

You can find the resource manager template for the resource group in the root of the repository as azuredeploy.json.

The structure of the template looks like this:

├── shared
│   └── naming-convention.json
├── resources
│   ├── databricks-workspace.json
│   ├── data-lake.json
│   └── data-factory.json
├── parameters.json
├── azuredeploy.json
├── README.md
└── LICENSE

The shared folder contains general-purpose templates. In the case of the modern data warehouse templates, it contains a template that can generate a resource name based on conventions.

In addition to the shared folder there's the resources folder that contains all the resources that need to be deployed. Each file contains a single resource with configurable parameters that are provided by the root template.

Finally, in the root folder, you'll find not only the azuredeploy.json file which implements the root template. You'll also find the parameters.json file that serves as a sample set of parameters that you can feed into the template when deploying to azure.

Now that you have an idea of the template structure, let's take a closer look at the naming conventions template.

Using naming conventions

When creating resources in Azure, it's generally a good idea to follow some convention for the names of the resources. Since I'm generating resources that are used by the whole company, I'm using the following convention:

[company-name][resource-type][environment][instance-number]

I've created a separate template that generates names according to this convention. It looks like this:

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.1.0",
    "parameters": {
        "companyName": {
            "type": "string",
            "metadata": {
                "description": "The name of the company you're deploying for"
            }
        },
        "environment": {
            "type": "string",
            "allowedValues": ["dev","test","acc","prod"],
            "metadata": {
                "description": "The short name of the environment that you're deploying"
            }
        },
        "resourceTypeAbbreviation": {
            "type": "string",
            "allowedValues": [
                "adf",
                "dl",
                "kv",
                "sa",
                "bricks"
            ],
            "metadata": {
                "description": "The resource type you're deploying"
            }
        },
        "instanceNumber": {
            "type": "int",
            "defaultValue": 1,
            "metadata": {
                "description": "The instance number of the resource that will be used as a suffix"
            }
        }
    },
    "resources": [],
    "outputs": {
        "resourceName": {
            "type": "string",
            "value": "[concat(toLower(parameters('companyName')), toLower(parameters('resourceTypeAbbreviation')), toLower(parameters('environment')), padLeft(string(parameters('instanceNumber')),2,'0'))]"
        }
    }
}

You'll need to supply three values to the resource template, company name, resource type abbreviation, and instance number. If you don't specify an instance number, on is generated for you.

Notice that the resource collection for the template is empty. Instead, this template generates an output variable that defines a conventional resource name.

To use the resource template, you'll need to include a reference to the naming-convention.json file in the root template:

{
    "name": "dataFactoryName",
    "apiVersion": "2018-05-01",
    "type": "Microsoft.Resources/deployments",
    "location": "[parameters('location')]",
    "properties": {
        "mode": "Incremental",
        "templateLink": {
            "uri": "[variables('namingConventionTemplate')]",
            "contentVersion": "1.0.1.0"
        },
        "parameters": {
            "companyName": {
                "value": "[parameters('companyName')]"
            },
            "resourceTypeAbbreviation": {
                "value": "adf"
            },
            "environment": {
                "value": "[parameters('environment')]"
            }
        }
    }
},

In the fragment above, we're generating a name for the data factory resource. When you run the template, it generates the resource name.

To use the resource name, you can refer to it from the resource that we want to use the name for:

{
    "name": "dataFactoryTemplate",
    "type": "Microsoft.Resources/deployments",
    "apiVersion": "2018-05-01",
    "properties": {
        "mode": "Incremental",
        "templateLink": {
            "uri": "[variables('dataFactoryTemplate')]",
            "contentVersion": "1.0.0.0"
        },
        "parameters": {
            "dataFactoryName": {
                "value": "[reference('dataFactoryName').outputs.resourceName.value]"
            },
            "location": {
                "value": "[parameters('location')]"
            }
        }
    }
}

In this data factory template reference, we provide a parameter called dataFactoryName. The value for this parameter is [reference('dataFactoryName').outputs.resourceName.value]. In other words, we request a reference to the dataFactoryName resource that we deployed earlier and grab the resourceName output from the template.

You can generate names for multiple resources by referencing the same naming-convention.json template multiple times with a different name.

Deploying the resource templates

In the previous section, we talked about the structure of the resource manager templates. Now it's time to talk about deploying resources using the resource manager templates.

There are multiple ways in which you can deploy the resources that I've defined in the resource manager templates.

If you have the Azure CLI, you can use the following commands:

az group create -n <group-name> -l <location>
az group deployment create -g <group-name> --template-uri https://raw.githubusercontent.com/wmeints/modern-datawarehouse/master/azuredeploy.json --parameters @parameters.json

This will create a new resource group and then deploy the resource manager template in the resource group.

You'll need to have a parameters.json file that looks like this:

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "companyName": {
            "value": "<company-name>"
        },
        "environment": {
            "value": "dev"
        }
    }
}

When you run the command, the resource manager creates a new resource group and deploys the necessary resources. It will take a few minutes for the resource manager to deploy the template, so grab a coffee when you want to try this out.

Summary

In the previous sections, we've talked about settings up a modern data warehouse using resource manager templates.

We've seen how vital automation is when you want to create a new environment in Azure and how naming conventions help you keep a clean environment.

I hope you enjoyed this. Please leave an issue in the GitHub repo when you have any problems. I'm happy to help!