PolyBase Is A Picky Eater – Remove Carriage Returns Before Ingesting Text

Update: As Gerhard points out in the comments, switching to ORC files solves this issue nicely. It’s not human readable, but it is much less error-prone when reading in data. 

I’ve spent the last few weeks working on a project that used PolyBase to load data from Azure Blob Storage into Azure SQL Data Warehouse. While it’s been a great experience, I must note that PolyBase is a picky eater.

4184268 - finicky or picky bulldog pouting beside full bowl of dog food
PolyBase is a picky eater, just like this bulldog.

Creating the external tables is fairly simple if you are already familiar with T-SQL. The syntax for creating external data sources, file formats, and tables is fairly well documented on MSDN. It’s important to be aware of features and data types that aren’t supported, but I found it to be pretty smooth sailing beyond that.

But just because you have successfully created the external tables does not mean you are finished. That is when the “fun” begins. If you would like more information on why “fun” is in quotes, read Grant Fritchey’s blog post on Loading Data into Azure SQL Data Warehouse.

You should test after populating any table in SQL Server, but I think this is especially true with external tables. More than likely you will find that you must resolve several issues with source file contents and external table definitions.

First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.

One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.

To elaborate further, I had several fields that originally came from a SQL Server table and were of type text. Some of the values in these fields contained newlines (\n) as users had typed paragraphs and addresses into the fields in the source application. The data from the source tables was exported from SQL Server to Azure Blob Storage using Azure Data Factory with a simple copy pipeline with no modifications to the data. The problem is that Hive, PolyBase, and several other tools have issues reading strings with newlines the value. They immediately interpret it as the end of the row. There is no escape character or setting you can use to allow newlines in the values.

If you find yourself in a similar situation, trying to load data from delimited files into Azure SQL DW and realizing you have newlines inside of string fields, there are two things you can do.

  1. Fix the data in the flat files so it doesn’t contain new lines in string fields.
  2. Switch to a different tool to load data to Azure DW. Azure Data Factory can take the data from blob storage and import it into a normal table in Azure DW.

In most circumstances, I would go for option #1. Option #2 only fixes things in Azure DW, leaving other tools in the environment to deal with the issue separately, and it requires storing a copy of the data in the DW.

In my project, I changed the ADF pipelines to replace newlines with an acceptable character/set of characters that doesn’t often appear in my data set and doesn’t obscure the values. We chose to replace them with 4 spaces. It’s important to understand that this means that your data in your blob storage will no longer exactly match its source. This is something you will want to document because it will surely pop up somewhere in the future.

Updating the ADF pipelines is not much effort. If my table definition is

CREATE TABLE TableA (
    Column1 int,
    Column2 varchar(10),
    Column3 text
)

and my original pipeline sqlReaderQuery was SELECT * FROM TableA, I just need to enumerate my fields, convert the text fields to varchar, and replace the new lines.

SELECT
    [Column1],
    [Column2],
    [Column3] = REPLACE(
                  REPLACE(
                    CAST([Column3] AS varchar(8000)), 
                      CHAR(13), '    '), 
                      CHAR(10), '    ')
FROM TableA

If you are using time slices in ADF and you have your query inside of the Text.Format() function, you will find that ADF doesn’t allow the single quotes around the four spaces in your JSON. You can instead use CHAR(32) instead of a space. If you have a better way of accomplishing this, please leave me a note in the comments.

SELECT
    [Column1],
    [Column2],
    [Column3] = REPLACE(
                  REPLACE(
                    CAST([Column3] AS varchar(8000)), 
                      CHAR(13), CHAR(32) + CHAR(32) + CHAR(32) + CHAR(32)), 
                      CHAR(10), CHAR(32) + CHAR(32) + CHAR(32) + CHAR(32))
FROM TableA

In addition to updating the ADF pipelines, I also had to replace the newlines in my existing files in blob storage. Since there weren’t many of them, I just opened them up in Notepad++ and did a find & replace. If there had been more files, I would have looked into a more automated solution.

If the ability to allow field/row terminators within string fields is something you would like to see in the PolyBase, please voice your opinion by casting a vote on the feedback site.

5 thoughts on “PolyBase Is A Picky Eater – Remove Carriage Returns Before Ingesting Text

  1. Have you tried using the Unicode \u0027 instead of the single quotes? So it would look like \u0027 \u0027. Or perhaps try escaping the single quotes with backslashes? \’ \’

  2. Thanks, Chris. I’ll have to try the unicode approach. I hope I made clear that the choice of a character is arbitrary, although any string would need the single quotes around it (if I don’t use char()), and I’d love to have a less awkward tactic. It might be a while, but I’ll update this post if that works.

  3. Hi Meagan,
    I am currently in a very similar situation as you described it. I am loading data from an on-premises SQL DB to Azure Blob Store via Azure Data Factory and this was also on issue that I was hitting (besides several others). Some of them were related to ADF, some to CSV-fileformat and some to Polybase in general. I ended up using ORC files instead of CSV and all my previous issues went away:
    – NewLine/Carriage return is working
    – I do not need to specify the format for Date/Time datatypes explicitly
    – no need to specify all columns in ADF

    the only drawback is that ORC files are not as readable as CSV files but since Polybase is working just fine now this is OK. And also, all other major data platforms (foremost Hadoop) support ORC so this is also fine

    -gerhard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s