Category: Azure Data Factory

Azure Data Factory and the Case of the Missing JRE That Wasn’t

On a recent project I used Azure Data Factory (ADF) to retrieve data from an on premises SQL Server 2014 instance and land them in Azure Data Lake Store (ADLS) as ORC files. This required the use of the Data Management Gateway (DMG). Setup was quick and easy in our development environment. We installed the DMG for development on a separate server in the client’s network, where we also installed SQL Server Management Studio (SSMS) for query development and data validation. We set up resource groups in Azure for development and production, and made sure the settings for development and production were the same.  Then we set up a separate server for the production DMG.

Deployment and execution went well in the dev environment. Testing was completed, so we deployed to prod. Deployment went fine, but the pipelines failed execution and returned the following error on the output data sets.

ADF Error JRE
Java Runtime Environment is not found.

The Java Runtime Environment (JRE) is not required for the DMG to run successfully, but it is needed for the creation of ORC files. The only problem with this error message was that we did indeed have the JRE installed on the server.

After reinstalling the JRE and the DMG and getting the same error, I consulted the troubleshooting guide. After finding nothing relevant there, I asked some colleagues for suggestions.

  • I double-checked that I had the same version of the DMG that worked in dev and that I had the most current version of the JRE.
  • I double-checked that the DMG and JRE matched bit-wise (32-bit vs. 64-bit). Both were 64-bit in my case.
  • I checked that JAVA_HOME was set correctly in the environment variables.

When none of those things worked, I logged a support ticket with Microsoft. They had me do the following:

  • Check the registry key – HKEY_LOCAL_MACHINE\Software\JavaSoft\Java Runtime Environment should have a Current Version entry that shows the current JRE version.
  • Check that the subkey in the folder labeled with the version has a JAVAHOME entry with the correct path (something like C:\Program Files\Java\jre1.8.0_74).
  • Open the path and check that the bin folder exists.
  • Check that jvm.dll exists in the bin/server folder.

When none of those things worked, they gave me one last suggestion:

Install the Microsoft Visual C++ 2010 Redistributable Package.

And that turned out to solve the problem!

After review, we realized that we had installed SSMS on the dev DMG server but not on the prod DMG server. SSMS would have required the installation of the C++ redistributable package, which is why we didn’t encounter this error in dev.

I will confess that I don’t understand exactly why missing C++ libraries manifest themselves in an error claiming a missing Java Runtime Environment. If you have a good explanation, please leave it in the comments and I’ll update this and give you credit.

I hope that someone else who runs into this issue will find this blog post and avoid days of troubleshooting and confusion.

I Like to Move It, Move It – But Azure Data Factory Doesn’t

I’ve spent the last couple of months working on a project that includes Azure Data Factory and Azure Data Warehouse. ADF has some nice capabilities for file management that never made it into SSIS such as zip/unzip files and copy from/to SFTP. But it also has some gaps I had to work around. My project involved copying data from on-premises SQL Server to an ORC file in a data lake staging area for ingestion into an Azure SQL Data Warehouse through Polybase. Then I had planned to move that file to the a raw area of the data lake for archiving.

In other words, as sung below by a great lemur, I like to move it.

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

What we ultimately ended up with was:

Azure Data Flow

  1. Retrieve time sliced data from on-premises SQL Server source via the Data Management Gateway.
  2. Land data in the Raw area of the data lake as ORC file.
  3. Copy file to staging.
  4. Execute stored procedure to populate data warehouse through Polybase.

I landed the data in Raw first so that we would not have to pull from SQL again if we needed to re-run a slice. Data latency wasn’t a huge issue for this client – we had some pipelines that ran hourly and some that ran daily. The extra seconds it took to land the file in Raw was not a concern.

For now, if you do need to actually move or delete, you can use a custom C# activity to delete files. I chose not to do this because I didn’t want to add another technology for the client to learn/manage while adopting Azure. This may be the way to go for other projects.

If you think moving (copying and deleting) files should be a first class citizen in Azure Data Factory, please vote for the idea and spread the word for others to vote.

You don’t have to thank me for getting that song stuck in your head for the rest of the day.

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 2

I showed in my previous post how we generated the datasets for our Azure Data Factory pipelines. In this post, I’ll show the BimlScript for our pipelines. Pipelines define the activities, identify the input and output datasets for those activities, and set an execution schedule. We were creating several pipelines with copy activities to copy data to Azure Data Lake Store.

We generated one pipeline per schedule and load type:

  • Hourly – Full
  • Hourly – Incremental
  • Daily – Full
  • Daily – Incremental

We also generated some one-time load pipelines for DR/new environment setup.

The first code file below is the template for the pipeline. You can see code nuggets for the data we receive from the generator file and for conditional logic we implemented. The result is one copy activity per source table within the appropriate pipeline.

In the second code file below, lines 104 to 119 are generating the pipelines. We read in the necessary data from the Excel file:

  • Schema name
  • Table name
  • Columns list
  • Incremental predicate

Sidenote: We wrote a quick T-SQL statement (not shown) to generate the columns list. This could have been done in our BimlScript, but it was something we changed after the fact to accommodate the limitations of Polybase (Dear Microsoft: Please fix). SQL was quicker and easier for us, but if I were to do this again I would add that into our BimlScript. We needed to replace new lines and double quotes in our data before we could read it in from the data lake.  You can get around this issue by using .ORC files rather than text delimited files. But the ORC files aren’t human readable, and we felt that was important for adoption of the data lake with the client on this project. They were already jumping in with several new technologies and we didn’t want to add anything else to the stack. So our select statements list out fields and replace the unwanted characters in the string fields.

Our Excel file looks like this.

ADF Biml Metadata

Columns B, C, L, and M are populated by Excel formulas. This is the file that is read in by the BimlScript in the code below.

In our generator file (which is the same file that was used to generate the datasets), we use the CallBimlScript function to call the pipeline template file and pass along the required properties (table, schema, frequency, scope, columns list, predicate).

The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.

And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”

We made changes and regenerated our pipelines a few times, which would have taken hours without Biml. With Biml, it was no big deal.

Thanks to Levi for letting me share some of his code, and for working with me on this project!

 

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source
  • Linked Services – define the connection string and other connection properties for each source and destination
  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
  • Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

  • Frequency (daily or hourly)
  • Changes Only (incremental or full load)
  • Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets.  You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads.  I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.