Azure Data Factory and the Case of the Missing JRE That Wasn’t

On a recent project I used Azure Data Factory (ADF) to retrieve data from an on premises SQL Server 2014 instance and land them in Azure Data Lake Store (ADLS) as ORC files. This required the use of the Data Management Gateway (DMG). Setup was quick and easy in our development environment. We installed the DMG for development on a separate server in the client’s network, where we also installed SQL Server Management Studio (SSMS) for query development and data validation. We set up resource groups in Azure for development and production, and made sure the settings for development and production were the same.  Then we set up a separate server for the production DMG.

Deployment and execution went well in the dev environment. Testing was completed, so we deployed to prod. Deployment went fine, but the pipelines failed execution and returned the following error on the output data sets.

ADF Error JRE
Java Runtime Environment is not found.

The Java Runtime Environment (JRE) is not required for the DMG to run successfully, but it is needed for the creation of ORC files. The only problem with this error message was that we did indeed have the JRE installed on the server.

After reinstalling the JRE and the DMG and getting the same error, I consulted the troubleshooting guide. After finding nothing relevant there, I asked some colleagues for suggestions.

  • I double-checked that I had the same version of the DMG that worked in dev and that I had the most current version of the JRE.
  • I double-checked that the DMG and JRE matched bit-wise (32-bit vs. 64-bit). Both were 64-bit in my case.
  • I checked that JAVA_HOME was set correctly in the environment variables.

When none of those things worked, I logged a support ticket with Microsoft. They had me do the following:

  • Check the registry key – HKEY_LOCAL_MACHINE\Software\JavaSoft\Java Runtime Environment should have a Current Version entry that shows the current JRE version.
  • Check that the subkey in the folder labeled with the version has a JAVAHOME entry with the correct path (something like C:\Program Files\Java\jre1.8.0_74).
  • Open the path and check that the bin folder exists.
  • Check that jvm.dll exists in the bin/server folder.

When none of those things worked, they gave me one last suggestion:

Install the Microsoft Visual C++ 2010 Redistributable Package.

And that turned out to solve the problem!

After review, we realized that we had installed SSMS on the dev DMG server but not on the prod DMG server. SSMS would have required the installation of the C++ redistributable package, which is why we didn’t encounter this error in dev.

I will confess that I don’t understand exactly why missing C++ libraries manifest themselves in an error claiming a missing Java Runtime Environment. If you have a good explanation, please leave it in the comments and I’ll update this and give you credit.

I hope that someone else who runs into this issue will find this blog post and avoid days of troubleshooting and confusion.

I Like to Move It, Move It – But Azure Data Factory Doesn’t

I’ve spent the last couple of months working on a project that includes Azure Data Factory and Azure Data Warehouse. ADF has some nice capabilities for file management that never made it into SSIS such as zip/unzip files and copy from/to SFTP. But it also has some gaps I had to work around. My project involved copying data from on-premises SQL Server to an ORC file in a data lake staging area for ingestion into an Azure SQL Data Warehouse through Polybase. Then I had planned to move that file to the a raw area of the data lake for archiving.

In other words, as sung below by a great lemur, I like to move it.

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

What we ultimately ended up with was:

Azure Data Flow

  1. Retrieve time sliced data from on-premises SQL Server source via the Data Management Gateway.
  2. Land data in the Raw area of the data lake as ORC file.
  3. Copy file to staging.
  4. Execute stored procedure to populate data warehouse through Polybase.

I landed the data in Raw first so that we would not have to pull from SQL again if we needed to re-run a slice. Data latency wasn’t a huge issue for this client – we had some pipelines that ran hourly and some that ran daily. The extra seconds it took to land the file in Raw was not a concern.

For now, if you do need to actually move or delete, you can use a custom C# activity to delete files. I chose not to do this because I didn’t want to add another technology for the client to learn/manage while adopting Azure. This may be the way to go for other projects.

If you think moving (copying and deleting) files should be a first class citizen in Azure Data Factory, please vote for the idea and spread the word for others to vote.

You don’t have to thank me for getting that song stuck in your head for the rest of the day.

Insufficient Disk Space (T-SQL Tuesday #88)

TSQLTuesdayThis month’s T-SQL Tuesday – hosted by Kennie T Pontoppidan(@KennieNP) – is called “The daily (database-related) WTF“. He asked us to be inspired by the IT horror stories from http://thedailywtf.com, and tell our own daily WTF story.

Years ago in a previous job, I worked at a company that had no DBAs. I am/was a BI developer, so I know my way around a database, but I wasn’t dedicated to keeping all databases in good health. There were several application developers at this company (mostly focused on .NET and Javascript) who built applications with SQL Server databases as the back end. And there was a guy who acted as a system admin among his many other duties. The application developers had built a web app that was to be used by users around the world. The application had been launched and things were fine for several weeks. I wasn’t involved with the project, but I was aware of it.

One day, a manager asked me if I could help on an urgent matter: the application suddenly could no longer execute transactions on the production database and the database connection was intermittently failing. The system admin was busy with other duties, so I was the closest thing they had to a DBA.  All they could tell me was the production database had crashed and they got an error message about insufficient disk space.

I logged on to the server that housed the database to see what was going on. The server itself had been set up appropriately and seemed to have sufficient memory and CPU to support the load of this application. I saw 3 volumes on the server: a C volume for application and system files, a large F volume for data, and a large G volume for logs.

I connected to the database with Management Studio to do some more digging. The first thing I noticed is that the dev, test, and prod databases for this application were all on the same SQL Server instance. The dev and test databases weren’t very large, so while that wasn’t what I would have recommended, that didn’t seem to be the main problem. As I looked at the prod database, I noticed that the MDF and LDF files were sitting on the C volume rather than the spacious F volume that was made for them! The person who configured the server hadn’t made the C volume very large since user databases weren’t supposed to be there.

Then I looked at the size of the log file. It was huge! A bit more digging revealed that they had left all the defaults on the database for full recovery and autogrowth of the log file, but they had never done a transaction log backup. (Sidenote: You can check the Log_Reuse_Wait_Desc column in sys.databases to verify the database is waiting on a transaction log backup.) The developers had worked long and hard to get the application up and running and hadn’t quite finished up the maintenance and disaster recovery tasks.

Once I knew what I was dealing with, I was able to fix the problem. A full backup and a log backup later we were back in business. I went ahead and shrunk the log file back to a reasonable size (please remember this is reserved for special occasions). I took the database offline (which was acceptable since the application was currently unusable anyway), moved the MDF and LDF files to their rightful home, and brought it back online. A lesson on recovery models and setting up SQL Agent jobs that scheduled such backups ensured this didn’t happen again anytime soon.

This should be a good reminder to have a healthy respect and understanding for your database settings and to make sure you have (and test) your backups (both full and transaction logs) for your production databases.

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 2

I showed in my previous post how we generated the datasets for our Azure Data Factory pipelines. In this post, I’ll show the BimlScript for our pipelines. Pipelines define the activities, identify the input and output datasets for those activities, and set an execution schedule. We were creating several pipelines with copy activities to copy data to Azure Data Lake Store.

We generated one pipeline per schedule and load type:

  • Hourly – Full
  • Hourly – Incremental
  • Daily – Full
  • Daily – Incremental

We also generated some one-time load pipelines for DR/new environment setup.

The first code file below is the template for the pipeline. You can see code nuggets for the data we receive from the generator file and for conditional logic we implemented. The result is one copy activity per source table within the appropriate pipeline.

In the second code file below, lines 104 to 119 are generating the pipelines. We read in the necessary data from the Excel file:

  • Schema name
  • Table name
  • Columns list
  • Incremental predicate

Sidenote: We wrote a quick T-SQL statement (not shown) to generate the columns list. This could have been done in our BimlScript, but it was something we changed after the fact to accommodate the limitations of Polybase (Dear Microsoft: Please fix). SQL was quicker and easier for us, but if I were to do this again I would add that into our BimlScript. We needed to replace new lines and double quotes in our data before we could read it in from the data lake.  You can get around this issue by using .ORC files rather than text delimited files. But the ORC files aren’t human readable, and we felt that was important for adoption of the data lake with the client on this project. They were already jumping in with several new technologies and we didn’t want to add anything else to the stack. So our select statements list out fields and replace the unwanted characters in the string fields.

Our Excel file looks like this.

ADF Biml Metadata

Columns B, C, L, and M are populated by Excel formulas. This is the file that is read in by the BimlScript in the code below.

In our generator file (which is the same file that was used to generate the datasets), we use the CallBimlScript function to call the pipeline template file and pass along the required properties (table, schema, frequency, scope, columns list, predicate).

The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.

And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”

We made changes and regenerated our pipelines a few times, which would have taken hours without Biml. With Biml, it was no big deal.

Thanks to Levi for letting me share some of his code, and for working with me on this project!

 

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source
  • Linked Services – define the connection string and other connection properties for each source and destination
  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
  • Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

  • Frequency (daily or hourly)
  • Changes Only (incremental or full load)
  • Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets.  You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads.  I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.

My Thoughts on SQL Saturday #596 – Denver BI

I had the pleasure of attending SQL Saturday Denver – BI this past weekend. They even let me help out a bit with registration and other volunteer tasks. This SQL Saturday was an experiment of sorts to prove out Steve Jones’s idea of slimmer SQL Saturdays. We had two tracks and 80 – 100 attendees. Steve would like to see each city be able to do 4 SQL Saturdays a year (which is currently against the rules), but keep them slim.

I think it’s great that Steve and Carlos put together the event for about $650 (and I heard it would have been close to $300, but they decided to do a speaker dinner to use up some sponsor money). This should show other organizers that their event doesn’t have to be big and expensive to be considered successful. Everyone had a good time and learned new things, and the venue was nice. They worked with a local university to get the space for free, which is much easier to do when you only need three rooms and a hallway. The quality of speakers was still quite high (Peter Meyers, Melissa Coates, Steve Wake, and others).

Part of the slimmer SQL Saturday is that they didn’t provide lunch. But our venue was within walking distance of several places, and it was nice to take a walk and get whatever food I wanted.

My Concerns and Things I’m Still Pondering

Here’s what I didn’t love or what I need more time to consider compared to other SQL Saturdays:

  • A lot of SQL family didn’t attend because they weren’t speaking and didn’t want to take up a spot for someone else who might be attending for the first time or needs the learning opportunities. For me, SQL Saturdays are about learning and community. I missed some of my SQL people. Having slimmer SQL Saturdays also means that the range of topics isn’t as broad, and there may be less incentive for more experienced people to attend (outside of the community aspect) if most/all of your topics are beginner level.
  • The little things matter to me. I ended up printing session evaluations so that speakers could get feedback and making sure people knew they could submit feedback online. Would the event have been fine without evals? Yes. But do some speakers very much want feedback from the audience, especially when trying out new sessions? Yes. If they had warned the speakers ahead of time, the speakers could have grabbed a few trusted people and asked them to attend their talk and provide feedback, making this a non-issue. I think whatever you can do to make things run smoothly and give people a good experience is usually worth it. Evals fall into that for me, but I fully acknowledge that they do not make or break the event.
  • There is still some overhead associated with planning even a small event. You still have to secure a venue, choose speakers and set the schedule, market the event, and spend your Saturday running the event. This is fine, sometimes even fun. I have organized 5 SQL Saturdays, and enjoyed it. But it is still time-consuming, and doing 4 of them a year makes me feel tired just thinking about it. If you can assemble a team of volunteers where 2 – 3 people plan and execute each event and you rotate duties, that sounds reasonable to me. Not every city has such a good team, though. We are more than just SQL people and lives get busy with personal or even other professional stuff. This needs to be something that isn’t overly burdensome for any one person in order to make it work.
  • Someone else made the comment “If we do these quarterly, what’s the difference between this and user group meetings? You would spend about 8 hours a quarter during the week attending meetings or 8 hours in one day attending a slimmed down SQL Saturday.” I can understand that thought process. I think of SQL Saturdays as a special once/twice a year thing. I don’t know that smaller/more frequent SQL Saturdays are better or worse than the norm, just different. I imagine that each city would find their own way to differentiate the value of SQL Saturday vs the user group.

My Takeaways

I hope this helps prove that a small event can be a great event. Do not feel like a failure just because your event doesn’t have 350 attendees or because you couldn’t get shirts and expensive gifts for the speakers and volunteers. I will admit that there was a bit of pressure to be bigger and better each year that I organized SQL Saturday KC, but that was almost entirely self-imposed. This was a good experience to help me really understand what is essential versus what is nice to have. When it comes down to it, having a slim but well planned SQL Saturday is better than not having one at all. That being said, if your SQL Saturday is large and well-funded and making people happy, don’t change a thing. Slimmer SQL Saturdays provide alternatives for events with smaller markets and/or smaller sponsorship availability.

I am now a fan of letting people get their own lunches, if your venue is in a location that can support it. Lunch at SQL Saturday KC was always expensive and took several volunteers to set up (taking money, placing orders, having food delivered and set up, accommodating dietary restrictions). And there were always people who felt like it wasn’t worth the $10/$15 dollars and wanted to complain to us afterward. As an organizer, I like the idea of skipping the headache of lunch and giving people the freedom to go get what they want. Plus it’s nice to take a walk after sitting in sessions all morning.

I love the partnership with University of Denver, not just use of their space. Part of the agreement made in getting the space at no cost for SQL Saturday Denver – BI  was that there would be sessions that were relevant and accessible for some of the students. Although there are more and more higher education programs focused on BI and data science, I still think the opportunity to get applied learning from the “real world” is valuable for them. I hope to see more SQL Saturdays partner with colleges and universities in the future.

I give this slimmer SQL Saturday two thumbs up and think others should consider it an option. Each event organizer should decide what’s important to them and make it happen. But know that you can have a good event for less than $1000 and minimal time spent planning if that is all you have.

Please Lend Me Your Vote for Documentation of TMSCHEMA DMVs

I spent a good bit of time looking for the definitions/descriptions of the TMSCHEMA DMVs that allow us to view metadata and monitor the health of SSAS 2016 tabular models. As far as I can tell there are no details about them on any Microsoft site. Many of the columns are obvious, but there are a few fields that show IDs rather than descriptions (e.g., ExplicitDataType in TMSCHEMA_COLUMNS, Type in TMSCHEMA_DATA_SOURCES). It would be great to get the DMVs documented similar to the MDSCHEMA DMVs as they are quite useful for tasks like documenting your tabular model.  Since the TMSCHEMA DMVs work in Azure Analysis Services as well, I have logged this request on the Azure AS User Voice for that. Please lend me a vote so we can make this information more easily available.

How can we improve Microsoft Azure Analysis Services?

  • 1 vote
  • 0 comments

Document TMSCHEMA DMVs

The DMVs for SSAS Tabular (Azure and SQL 2016) are not documented anywhere. While the meaning of many of the fields is obvious, there are a few that are just IDs for which it would be nice to see all possible values and descriptions. It would make sense to add the definitions here: https://msdn.m…