Azure, Azure Data Factory, Biml, Data Warehousing, Microsoft Technologies

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

Note: This post is about Azure Data Factory V1
Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source
  • Linked Services – define the connection string and other connection properties for each source and destination
  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
  • Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

  • Frequency (daily or hourly)
  • Changes Only (incremental or full load)
  • Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.


<#@ property name="schema" type="string" #>
<#@ property name="table" type="string" #>
<#@ property name="frequency" type = "string" #>
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json&quot;,
"name": "DS_OnPremSQL_MySourceDB_<#=schema#>_<#=table#>",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "LS_OnPremSQL_MySourceDB",
"structure": [],
"external": true,
"typeProperties": {
"tableName": "<#=schema#>.<#=table#>"
},
"availability": {
<# if (frequency == "Daily") {#>
"frequency": "Day",
<#} else if (frequency == "Hourly") {#>
"frequency": "Hour",
"style": "EndOfInterval",
<# } #>
"interval": 1
}
}
}


<#@ property name="schema" type="string" #>
<#@ property name="table" type="string" #>
<#@ property name="frequency" type = "string" #>
<#@ property name="scope" type="string" #>
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json&quot;,
"name": "DS_DataLake_ClawfootSupply_<#=schema#>_<#=table#>",
"properties": {
"type": "AzureDataLakeStore",
"linkedServiceName": "LS_ADLS_SHDataLake",
"structure": [ ],
"typeProperties": {
"folderPath": "/raw/clawfootsupply/<#=schema#>_<#=table#>/<#=scope.ToLower()#>/{Year}/{Month}/{Day}/",
"fileName": "<#=schema#>_<#=table#>_{Hour}{Minute}{Second}.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": "|",
"quoteChar": "\"",
"nullValue": ""
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "%M"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "%d"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "hh"
}
},
{
"name": "Minute",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "mm"
}
},
{
"name": "Second",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "ss"
}
}
]
},
"availability": {
<# if (frequency == "Daily") {#>
"frequency": "Day",
<#} else if (frequency == "Hourly") {#>
"frequency": "Hour",
"style": "EndOfInterval",
<# } #>
"interval": 1
}
}
}

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets.  You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads.  I’ll discuss lines 100 – 119 in my next post.


<#@ template tier="10" #>
<#@ import namespace="System.Data" #>
<#@ import namespace="System.Text" #>
<#@ code file="BGHelper.cs" #>
<#@ import namespace="BGHelper" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd"&gt;
</Biml>
<#
string mdFilePath = "C:\\Users\\admin\\Source\\Workspaces\\SH Data Warehouse\\metadata";
string mdFileName = "TargetTableMetadata.xlsx";
string mdWorkSheetName = "Metadata$";
bool mdHasHeader = true;
string logPath = "C:\\Users\\admin\\Source\\Workspaces\\Data Warehouse\\data_factory\\generate_data_factory_biml\\log.txt";
string adfProjPath = "C:\\Users\\admin\\Source\\Workspaces\\Data Warehouse\\data_factory\\data_factory\\";
DataSet ds = new DataSet();
ds = ExcelReader.ReadExcelQuery(mdFilePath, mdFileName, mdWorkSheetName, mdHasHeader);
System.IO.File.AppendAllText(@logPath, "MetaData File Path: " + System.IO.Path.Combine(mdFilePath, mdFileName).ToString() +Environment.NewLine );
System.IO.File.AppendAllText(@logPath, "MetaData File Path: " + System.IO.File.Exists(System.IO.Path.Combine(mdFilePath, mdFileName)).ToString() +Environment.NewLine );
System.IO.File.AppendAllText(@logPath, "Dataset table count: " + ds.Tables.Count.ToString() + Environment.NewLine);
DataView dailyFulls = new DataView(ds.Tables["Metadata"],"Frequency = 'Daily' and [Changes Only] = 'No'","", DataViewRowState.CurrentRows);
DataView dailyDeltas = new DataView(ds.Tables["Metadata"], "Frequency = 'Daily' and [Changes Only] = 'Yes'", "", DataViewRowState.CurrentRows);
DataView hourlyFulls = new DataView(ds.Tables["Metadata"], "Frequency = 'Hourly' and [Changes Only] = 'No'", "", DataViewRowState.CurrentRows);
DataView hourlyDeltas = new DataView(ds.Tables["Metadata"], "Frequency = 'Hourly' and [Changes Only] = 'Yes'", "", DataViewRowState.CurrentRows);
//log count of results for each filter
System.IO.File.AppendAllText(@logPath, "Daily Fulls Count: " + dailyFulls.Count.ToString() + Environment.NewLine);
System.IO.File.AppendAllText(@logPath, "Daily Deltas Count: " + dailyDeltas.Count.ToString() + Environment.NewLine);
System.IO.File.AppendAllText(@logPath, "Hourly Fulls Count: " + hourlyFulls.Count.ToString() + Environment.NewLine);
System.IO.File.AppendAllText(@logPath, "Hourly Deltas Count: " + hourlyDeltas.Count.ToString() + Environment.NewLine);
//Generate datasets
foreach (DataRowView rowView in dailyFulls)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "full";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
foreach (DataRowView rowView in dailyDeltas)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "deltas";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
foreach (DataRowView rowView in hourlyFulls)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "full";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
foreach (DataRowView rowView in hourlyDeltas)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "deltas";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
// Generate pipelines
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Daily_Full.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", dailyFulls, "Daily", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Daily_Deltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", dailyDeltas, "Daily", "Deltas"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Hourly_Full.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", hourlyFulls, "Hourly", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Hourly_Deltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", hourlyDeltas, "Hourly", "Deltas"));
//Generate One-Time Pipelines
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_DailyFulls.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", dailyFulls, "Daily", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_DailyDeltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", dailyDeltas, "Daily", "Deltas"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_HourlyFulls.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", hourlyFulls, "Hourly", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_HourlyDeltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", hourlyDeltas, "Hourly", "Deltas"));
#>

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.

https://gist.github.com/mlongoria/b1ce4334a793cca3d1b8c76c51c25dbe#file-adf_adls_dataset-biml

4 thoughts on “Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1”

  1. Hi I don’t get the ExcelReader to work ExcelReader.ReadExcelQuery. I have installed the AccessDatabaseEngine Also where do you get BGHelper from?

    1. BGHelper is a C# file that reads in the file contents from Excel. It contains the Excel Reader. That is not a built-in, but something we wrote to read in the file. I don’t have that file shared anywhere (i was just sharing the snippets related to ADF, but you can easily write your own in C#. All it does is read the contents of the Excel file into a data view.
      Also, if you are just starting with Biml for ADF, you might want to know that ADF V2 has a different API and Biml will treat ADF objects and first class citizens so you won’t have to do what I did in this blog post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s