Update: As Gerhard points out in the comments, switching to ORC files solves this issue nicely. It’s not human readable, but it is much less error-prone when reading in data.
I’ve spent the last few weeks working on a project that used PolyBase to load data from Azure Blob Storage into Azure SQL Data Warehouse. While it’s been a great experience, I must note that PolyBase is a picky eater.
Creating the external tables is fairly simple if you are already familiar with T-SQL. The syntax for creating external data sources, file formats, and tables is fairly well documented on MSDN. It’s important to be aware of features and data types that aren’t supported, but I found it to be pretty smooth sailing beyond that.
But just because you have successfully created the external tables does not mean you are finished. That is when the “fun” begins. If you would like more information on why “fun” is in quotes, read Grant Fritchey’s blog post on Loading Data into Azure SQL Data Warehouse.
You should test after populating any table in SQL Server, but I think this is especially true with external tables. More than likely you will find that you must resolve several issues with source file contents and external table definitions.
First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.
One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.
To elaborate further, I had several fields that originally came from a SQL Server table and were of type text. Some of the values in these fields contained newlines (\n) as users had typed paragraphs and addresses into the fields in the source application. The data from the source tables was exported from SQL Server to Azure Blob Storage using Azure Data Factory with a simple copy pipeline with no modifications to the data. The problem is that Hive, PolyBase, and several other tools have issues reading strings with newlines the value. They immediately interpret it as the end of the row. There is no escape character or setting you can use to allow newlines in the values.
If you find yourself in a similar situation, trying to load data from delimited files into Azure SQL DW and realizing you have newlines inside of string fields, there are two things you can do.
- Fix the data in the flat files so it doesn’t contain new lines in string fields.
- Switch to a different tool to load data to Azure DW. Azure Data Factory can take the data from blob storage and import it into a normal table in Azure DW.
In most circumstances, I would go for option #1. Option #2 only fixes things in Azure DW, leaving other tools in the environment to deal with the issue separately, and it requires storing a copy of the data in the DW.
In my project, I changed the ADF pipelines to replace newlines with an acceptable character/set of characters that doesn’t often appear in my data set and doesn’t obscure the values. We chose to replace them with 4 spaces. It’s important to understand that this means that your data in your blob storage will no longer exactly match its source. This is something you will want to document because it will surely pop up somewhere in the future.
Updating the ADF pipelines is not much effort. If my table definition is
CREATE TABLE TableA ( Column1 int, Column2 varchar(10), Column3 text )
and my original pipeline sqlReaderQuery was
SELECT * FROM TableA, I just need to enumerate my fields, convert the text fields to varchar, and replace the new lines.
SELECT [Column1], [Column2], [Column3] = REPLACE( REPLACE( CAST([Column3] AS varchar(8000)), CHAR(13), ' '), CHAR(10), ' ') FROM TableA
If you are using time slices in ADF and you have your query inside of the Text.Format() function, you will find that ADF doesn’t allow the single quotes around the four spaces in your JSON. You can instead use CHAR(32) instead of a space. If you have a better way of accomplishing this, please leave me a note in the comments.
SELECT [Column1], [Column2], [Column3] = REPLACE( REPLACE( CAST([Column3] AS varchar(8000)), CHAR(13), CHAR(32) + CHAR(32) + CHAR(32) + CHAR(32)), CHAR(10), CHAR(32) + CHAR(32) + CHAR(32) + CHAR(32)) FROM TableA
In addition to updating the ADF pipelines, I also had to replace the newlines in my existing files in blob storage. Since there weren’t many of them, I just opened them up in Notepad++ and did a find & replace. If there had been more files, I would have looked into a more automated solution.
If the ability to allow field/row terminators within string fields is something you would like to see in the PolyBase, please voice your opinion by casting a vote on the feedback site.