I adapted Aaron’s script to work in Azure SQL Data Warehouse and am sharing it with you below, so you don’t have to do the same. I did leave out holidays because I didn’t need them in my calendar table. To add them back in, just add the column back to the table and use Aaron’s update statements. I also changed the MonthYear field to include a space between the month and year. Otherwise, my script should produce the same results as Aaron’s in a Azure SQL DW friendly way. Notice that I chose to use Round Robin distribution for my date table. Also, tables are now created with clustered columnstore indexes by default in Azure SQL DW.
Instead of the computed columns, I inserted the dates, then updated the other columns in the temporary table. Then I used the Create Table As Select syntax to create my final RPT.Calendar table.
Update: As Gerhard points out in the comments, switching to ORC files solves this issue nicely. It’s not human readable, but it is much less error-prone when reading in data.
I’ve spent the last few weeks working on a project that used PolyBase to load data from Azure Blob Storage into Azure SQL Data Warehouse. While it’s been a great experience, I must note that PolyBase is a picky eater.
But just because you have successfully created the external tables does not mean you are finished. That is when the “fun” begins. If you would like more information on why “fun” is in quotes, read Grant Fritchey’s blog post on Loading Data into Azure SQL Data Warehouse.
You should test after populating any table in SQL Server, but I think this is especially true with external tables. More than likely you will find that you must resolve several issues with source file contents and external table definitions.
First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.
One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.
To elaborate further, I had several fields that originally came from a SQL Server table and were of type text. Some of the values in these fields contained newlines (\n) as users had typed paragraphs and addresses into the fields in the source application. The data from the source tables was exported from SQL Server to Azure Blob Storage using Azure Data Factory with a simple copy pipeline with no modifications to the data. The problem is that Hive, PolyBase, and several other tools have issues reading strings with newlines the value. They immediately interpret it as the end of the row. There is no escape character or setting you can use to allow newlines in the values.
If you find yourself in a similar situation, trying to load data from delimited files into Azure SQL DW and realizing you have newlines inside of string fields, there are two things you can do.
Fix the data in the flat files so it doesn’t contain new lines in string fields.
Switch to a different tool to load data to Azure DW. Azure Data Factory can take the data from blob storage and import it into a normal table in Azure DW.
In most circumstances, I would go for option #1. Option #2 only fixes things in Azure DW, leaving other tools in the environment to deal with the issue separately, and it requires storing a copy of the data in the DW.
In my project, I changed the ADF pipelines to replace newlines with an acceptable character/set of characters that doesn’t often appear in my data set and doesn’t obscure the values. We chose to replace them with 4 spaces. It’s important to understand that this means that your data in your blob storage will no longer exactly match its source. This is something you will want to document because it will surely pop up somewhere in the future.
Updating the ADF pipelines is not much effort. If my table definition is
and my original pipeline sqlReaderQuery was SELECT * FROM TableA, I just need to enumerate my fields, convert the text fields to varchar, and replace the new lines.
[Column3] = REPLACE(
CAST([Column3] AS varchar(8000)),
CHAR(13), ' '),
CHAR(10), ' ')
If you are using time slices in ADF and you have your query inside of the Text.Format() function, you will find that ADF doesn’t allow the single quotes around the four spaces in your JSON. You can instead use CHAR(32) instead of a space. If you have a better way of accomplishing this, please leave me a note in the comments.
In addition to updating the ADF pipelines, I also had to replace the newlines in my existing files in blob storage. Since there weren’t many of them, I just opened them up in Notepad++ and did a find & replace. If there had been more files, I would have looked into a more automated solution.
If the ability to allow field/row terminators within string fields is something you would like to see in the PolyBase, please voice your opinion by casting a vote on the feedback site.