Azure, Azure Data Factory, Azure Data Lake, Microsoft Technologies

I Like to Move It, Move It – But Azure Data Factory Doesn’t

I’ve spent the last couple of months working on a project that includes Azure Data Factory and Azure Data Warehouse. ADF has some nice capabilities for file management that never made it into SSIS such as zip/unzip files and copy from/to SFTP. But it also has some gaps I had to work around. My project involved copying data from on-premises SQL Server to an ORC file in a data lake staging area for ingestion into an Azure SQL Data Warehouse through Polybase. Then I had planned to move that file to the a raw area of the data lake for archiving.

In other words, as sung below by a great lemur, I like to move it.

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

What we ultimately ended up with was:

Azure Data Flow

  1. Retrieve time sliced data from on-premises SQL Server source via the Data Management Gateway.
  2. Land data in the Raw area of the data lake as ORC file.
  3. Copy file to staging.
  4. Execute stored procedure to populate data warehouse through Polybase.

I landed the data in Raw first so that we would not have to pull from SQL again if we needed to re-run a slice. Data latency wasn’t a huge issue for this client – we had some pipelines that ran hourly and some that ran daily. The extra seconds it took to land the file in Raw was not a concern.

For now, if you do need to actually move or delete, you can use a custom C# activity to delete files. I chose not to do this because I didn’t want to add another technology for the client to learn/manage while adopting Azure. This may be the way to go for other projects.

If you think moving (copying and deleting) files should be a first class citizen in Azure Data Factory, please vote for the idea and spread the word for others to vote.

You don’t have to thank me for getting that song stuck in your head for the rest of the day.