Copying data from On Prem SQL to ADLS with ADF and Biml

Note: This post is about Azure Data Factory V1
Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

Gateway – Allows ADF to retrieve data from an on premises data source
Linked Services – define the connection string and other connection properties for each source and destination
Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

Frequency (daily or hourly)
Changes Only (incremental or full load)
Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets. You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads. I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.

4 Comments

Pingback: Copying data from On Prem SQL to ADLS with ADF and Biml – Part 2 – Data Savvy

Pingback: Azure Data Factory and SSIS better together with ADF V2 Preview – Jorg Klein's Blog

Hi I don’t get the ExcelReader to work ExcelReader.ReadExcelQuery. I have installed the AccessDatabaseEngine Also where do you get BGHelper from?

Meagan Longoria says:

November 22, 2017 at 10:07 am

BGHelper is a C# file that reads in the file contents from Excel. It contains the Excel Reader. That is not a built-in, but something we wrote to read in the file. I don’t have that file shared anywhere (i was just sharing the snippets related to ADF, but you can easily write your own in C#. All it does is read the contents of the Excel file into a data view.
Also, if you are just starting with Biml for ADF, you might want to know that ADF V2 has a different API and Biml will treat ADF objects and first class citizens so you won’t have to do what I did in this blog post.

Reply

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

The Results

4 Comments

Leave a Reply Cancel reply

Subscribe to blog posts

Oh hi there 👋
It’s nice to meet you.

Categories

The Results

Related Posts

Invoking Another Pipeline in Microsoft Fabric

The trade-offs associated with low-code solutions

Join me at the Microsoft Fabric Community Conference with a discount code

4 Comments

Leave a Reply Cancel reply

Subscribe to blog posts

Oh hi there 👋It’s nice to meet you.

Categories

Oh hi there 👋
It’s nice to meet you.