Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.
My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.
My Azure Data Factory is made up of the following components:
- Gateway – Allows ADF to retrieve data from an on premises data source
- Linked Services – define the connection string and other connection properties for each source and destination
- Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
- Pipelines – combine the data sets and activities and define an execution schedule
Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.
In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:
- Frequency (daily or hourly)
- Changes Only (incremental or full load)
- Changed Time Column (datetime column used for incremental loads)
That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.
This post will show how we built the data sets. The following post will show the pipelines with the copy activities.
First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.
Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.
Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.
We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets. You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads. I’ll discuss lines 100 – 119 in my next post.
We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.
Stay tuned for part 2 where I show how we generated the ADF pipelines.