Note: This post is about Azure Data Factory V1 I’ve spent the last couple of months working on a project that includes Azure Data Factory and Azure Data Warehouse. ADF has some nice capabilities for file management that never made it into SSIS such as zip/unzip files and copy from/to SFTP. But it also has some gaps I had to work around. My project involved copying data from on-premises SQL Server to an ORC file in a data lake staging area for ingestion into an Azure SQL Data Warehouse through Polybase. Then I had planned to move that file to the a raw area of the data lake for archiving.
In other words, as sung below by a great lemur, I like to move it.
But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).
Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.
What we ultimately ended up with was:
Retrieve time sliced data from on-premises SQL Server source via the Data Management Gateway.
Land data in the Raw area of the data lake as ORC file.
Copy file to staging.
Execute stored procedure to populate data warehouse through Polybase.
I landed the data in Raw first so that we would not have to pull from SQL again if we needed to re-run a slice. Data latency wasn’t a huge issue for this client – we had some pipelines that ran hourly and some that ran daily. The extra seconds it took to land the file in Raw was not a concern.
For now, if you do need to actually move or delete, you can use a custom C# activity to delete files. I chose not to do this because I didn’t want to add another technology for the client to learn/manage while adopting Azure. This may be the way to go for other projects.
If you think moving (copying and deleting) files should be a first class citizen in Azure Data Factory, please vote for the idea and spread the word for others to vote.
I support 'Move Activity' – An activity that copies and then deletes in Azure Data Factory. What do you think? https://t.co/fyBE7tyw8w
Note: This post is about Azure Data Factory V1 I showed in my previous post how we generated the datasets for our Azure Data Factory pipelines. In this post, I’ll show the BimlScript for our pipelines. Pipelines define the activities, identify the input and output datasets for those activities, and set an execution schedule. We were creating several pipelines with copy activities to copy data to Azure Data Lake Store.
We generated one pipeline per schedule and load type:
Hourly – Full
Hourly – Incremental
Daily – Full
Daily – Incremental
We also generated some one-time load pipelines for DR/new environment setup.
The first code file below is the template for the pipeline. You can see code nuggets for the data we receive from the generator file and for conditional logic we implemented. The result is one copy activity per source table within the appropriate pipeline.
In the second code file below, lines 104 to 119 are generating the pipelines. We read in the necessary data from the Excel file:
Sidenote: We wrote a quick T-SQL statement (not shown) to generate the columns list. This could have been done in our BimlScript, but it was something we changed after the fact to accommodate the limitations of Polybase (Dear Microsoft: Please fix). SQL was quicker and easier for us, but if I were to do this again I would add that into our BimlScript. We needed to replace new lines and double quotes in our data before we could read it in from the data lake. You can get around this issue by using .ORC files rather than text delimited files. But the ORC files aren’t human readable, and we felt that was important for adoption of the data lake with the client on this project. They were already jumping in with several new technologies and we didn’t want to add anything else to the stack. So our select statements list out fields and replace the unwanted characters in the string fields.
Our Excel file looks like this.
Columns B, C, L, and M are populated by Excel formulas. This is the file that is read in by the BimlScript in the code below.
In our generator file (which is the same file that was used to generate the datasets), we use the CallBimlScript function to call the pipeline template file and pass along the required properties (table, schema, frequency, scope, columns list, predicate).
The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.
And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”
Note: This post is about Azure Data Factory V1
Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.
My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.
My Azure Data Factory is made up of the following components:
Gateway – Allows ADF to retrieve data from an on premises data source
Linked Services – define the connection string and other connection properties for each source and destination
Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
Pipelines – combine the data sets and activities and define an execution schedule
Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.
In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:
Frequency (daily or hourly)
Changes Only (incremental or full load)
Changed Time Column (datetime column used for incremental loads)
That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.
This post will show how we built the data sets. The following post will show the pipelines with the copy activities.
First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.
Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.
Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.
We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets. You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads. I’ll discuss lines 100 – 119 in my next post.
We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.
Stay tuned for part 2 where I show how we generated the ADF pipelines.
I spent a good bit of time looking for the definitions/descriptions of the TMSCHEMA DMVs that allow us to view metadata and monitor the health of SSAS 2016 tabular models. As far as I can tell there are no details about them on any Microsoft site. Many of the columns are obvious, but there are a few fields that show IDs rather than descriptions (e.g., ExplicitDataType in TMSCHEMA_COLUMNS, Type in TMSCHEMA_DATA_SOURCES). It would be great to get the DMVs documented similar to the MDSCHEMA DMVs as they are quite useful for tasks like documenting your tabular model. Since the TMSCHEMA DMVs work in Azure Analysis Services as well, I have logged this request on the Azure AS User Voice for that. Please lend me a vote so we can make this information more easily available.
I adapted Aaron’s script to work in Azure SQL Data Warehouse and am sharing it with you below, so you don’t have to do the same. I did leave out holidays because I didn’t need them in my calendar table. To add them back in, just add the column back to the table and use Aaron’s update statements. I also changed the MonthYear field to include a space between the month and year. Otherwise, my script should produce the same results as Aaron’s in a Azure SQL DW friendly way. Notice that I chose to use Round Robin distribution for my date table. Also, tables are now created with clustered columnstore indexes by default in Azure SQL DW.
Instead of the computed columns, I inserted the dates, then updated the other columns in the temporary table. Then I used the Create Table As Select syntax to create my final RPT.Calendar table.
Update: As Gerhard points out in the comments, switching to ORC files solves this issue nicely. It’s not human readable, but it is much less error-prone when reading in data.
I’ve spent the last few weeks working on a project that used PolyBase to load data from Azure Blob Storage into Azure SQL Data Warehouse. While it’s been a great experience, I must note that PolyBase is a picky eater.
But just because you have successfully created the external tables does not mean you are finished. That is when the “fun” begins. If you would like more information on why “fun” is in quotes, read Grant Fritchey’s blog post on Loading Data into Azure SQL Data Warehouse.
You should test after populating any table in SQL Server, but I think this is especially true with external tables. More than likely you will find that you must resolve several issues with source file contents and external table definitions.
First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.
One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.
To elaborate further, I had several fields that originally came from a SQL Server table and were of type text. Some of the values in these fields contained newlines (\n) as users had typed paragraphs and addresses into the fields in the source application. The data from the source tables was exported from SQL Server to Azure Blob Storage using Azure Data Factory with a simple copy pipeline with no modifications to the data. The problem is that Hive, PolyBase, and several other tools have issues reading strings with newlines the value. They immediately interpret it as the end of the row. There is no escape character or setting you can use to allow newlines in the values.
If you find yourself in a similar situation, trying to load data from delimited files into Azure SQL DW and realizing you have newlines inside of string fields, there are two things you can do.
Fix the data in the flat files so it doesn’t contain new lines in string fields.
Switch to a different tool to load data to Azure DW. Azure Data Factory can take the data from blob storage and import it into a normal table in Azure DW.
In most circumstances, I would go for option #1. Option #2 only fixes things in Azure DW, leaving other tools in the environment to deal with the issue separately, and it requires storing a copy of the data in the DW.
In my project, I changed the ADF pipelines to replace newlines with an acceptable character/set of characters that doesn’t often appear in my data set and doesn’t obscure the values. We chose to replace them with 4 spaces. It’s important to understand that this means that your data in your blob storage will no longer exactly match its source. This is something you will want to document because it will surely pop up somewhere in the future.
Updating the ADF pipelines is not much effort. If my table definition is
and my original pipeline sqlReaderQuery was SELECT * FROM TableA, I just need to enumerate my fields, convert the text fields to varchar, and replace the new lines.
[Column3] = REPLACE(
CAST([Column3] AS varchar(8000)),
CHAR(13), ' '),
CHAR(10), ' ')
If you are using time slices in ADF and you have your query inside of the Text.Format() function, you will find that ADF doesn’t allow the single quotes around the four spaces in your JSON. You can instead use CHAR(32) instead of a space. If you have a better way of accomplishing this, please leave me a note in the comments.
In addition to updating the ADF pipelines, I also had to replace the newlines in my existing files in blob storage. Since there weren’t many of them, I just opened them up in Notepad++ and did a find & replace. If there had been more files, I would have looked into a more automated solution.
If the ability to allow field/row terminators within string fields is something you would like to see in the PolyBase, please voice your opinion by casting a vote on the feedback site.