Azure, Databricks, Microsoft Technologies, Unity Catalog

Creating a Unity Catalog in Azure Databricks

Unity Catalog in Databricks provides a single place to create and manage data access policies that apply across all workspaces and users in an organization. It also provides a simple data catalog for users to explore. So when a client wanted to create a place for statisticians and data scientists to explore the data in their data lake using a web interface, I suggested we use Databricks with Unity Catalog.

New account management and roles

There are some Databricks concepts that may be new for you in Azure when you use Unity Catalog. While Databricks workspaces are mostly the same, you now have account (organization) -level roles. This is necessary because Unity Catalog crosses workspaces.

Databricks account covers all workspaces in your Azure tenant. There are 3 account-level roles: account admin, metastore admin, account user. Underneath the account are one or more workspaces. Workspace roles include workspace admins and workspace users.
There are two levels of accounts and admins in Azure Databricks

Users and service principals created in a workspace are synced to the account as account-level users and service principals. Workspace-local groups are not synced to the account. There are now account-level groups that can be used in workspaces.

To manage your account, you can go to https://accounts.azuredatabricks.net/. If you log in with an AAD B2B user, you’ll need to open the account portal from within a workspace. To do this, go to your workspace and select your username in the top right of the page to open the menu. Then choose the Manage Account option in the menu. It will open a new browser window.

The user menu in a Databricks workspace with the fourth option, Manage Account emphasized.
To launch the account console, choose Manage Account from the menu under your username in a workspace

Requirements

To create a Unity Catalog metastore you’ll need:

The pricing tier is set on the basics page when creating the Databricks workspace.

The Basics page of the Create and Azure Databricks workspace workflow in the Azure Portal has a pricing tier option near the bottom. Make sure this is set to Premium (+ Role-based access controls).
If you plan to use Unity Catalog, be sure to select the Premium pricing tier for your Databricks workspace

High-level steps

  1. Create a storage container in your ADLS account.
  2. Create an access connector for Azure Databricks.
  3. Assign the Storage Blob Data contributor role on the storage account to the managed identity for the access connector for Azure Databricks.
  4. Assign yourself account administrator in the Databricks account console.
  5. Create a Unity Catalog metastore.
  6. Assign a workspace to the Unity Catalog metastore.

The storage container holds the metadata and any managed data for your metastore. You’ll likely want to use this Unity Catalog metastore rather than the default hive metastore that comes with your Databricks workspace.

The access connector will show up as a separate resource in the Azure Portal.

The Azure Portal showing two resources in a list. The type of the first resource is Access Connector for Azure Databricks. The type of the second resource is Azure Databricks Service.

You don’t grant storage account access to end users – you’ll grant user access to data via Unity Catalog. The access connector allows the workspace to access the data in the storage account on behalf of Unity Catalog users. This is why you must assign the Storage Blob Data Contributor to the access connector.

The Add role assignment page for Access Control on the storage account. The selected role is Storage Blob Data Contributor. Assign access to is set to Managed Identity. The member listed is DBxAccessConnector, which is the access connector for Azure Databricks.
Assigning the Storage Blob Data Contributor role to the managed identity for the Access Connector for Azure Databricks

The confusing part of setup

If you are not a Databricks account administrator, you won’t see the option to create a metastore in the account console. If you aren’t an AAD Global Admin, you need an AAD Global Admin to log into the Databricks account console and assign your user to the account admin role. It’s quite possible that the AAD Global Admin(s) in your tenant don’t know and don’t care what Databricks or Unity Catalog is. And most data engineers are not global admin in their company’s tenant. If you think this requirement should be changed, feel free to vote for my idea here.

Once you have been assigned the account admin role in Databricks, you will see the button to create a metastore in the account console.

The Databricks account console has a large green button labeled Create a metastore, which is only visible to account admins.
The Create a meatastore button is only available for Databricks account admins

One or multiple metastores?

The Azure documentation recommends only creating one metastore per region and assigning that metastore to multiple workspaces. The current recommended best practice is to have one catalog that spans environments, business units, and teams. Currently, there is no technical limitation keeping you from creating multiple metastores. The recommendation is pointing you toward a single, centralized place to manage data and permissions.

Within your metastore, you can organize your data into catalogs and schemas. So it could be feasible to use only one metastore if you have all Databricks resources in one region.

In my first metastore, I’m using catalogs to distinguish environments (dev/test/prod) and schemas to distinguish business units. In my scenario, each business unit owns their own data. Within those schemas are external tables and views. Because it is most likely that someone would need to see all data for a specific business unit, this makes it easy to grant user access at the schema level.

I’ll save table creation and user access for another post. This post stops at getting you through all the setup steps to create your Unity Catalog metastore.

If you prefer learning from videos, I’ve found the Unity Catalog videos on the Advancing Spark YouTube channel to be very helpful.

Azure, Azure Data Factory, Microsoft Technologies

The Reason We Use Only One Git Repo For All Environments of an Azure Data Factory Solution

I’ve seen a few people start Azure Data Factory (ADF) projects assuming that we would have one source control repo per environment, meaning that you would attach a Git repo to Dev, and another Git repo to Test and another to Prod.

Microsoft recommends against this, saying:

“Only the development factory is associated with a git repository. The test and production factories shouldn’t have a git repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource Management template.”

Microsoft Learn content on continuous integration and delivery in Azure Data Factory

You’ll find the recommendation of only one repo connected to the development data factory echoed by other ADF practitioners, including myself.

We use source control largely to control code changes and for disaster recovery. I think the desire to use multiple repos is about disaster recovery more than anything. If something bad happens, you want to be able to access and re-deploy your code as quickly as possible. And since we start building in our repo-connected dev environment, some people feel “unprotected” in higher environments.

But Why Not Have All the Repos?

For me, there are two main reasons to have only one repository per solution tied only to the development data factory.

First, having multiple repos adds more complexity with very few benefits.

Deployment Complexity

Having a repo per environment adds extra work in deployment. I can see no additional benefits for deployment from having a repo per environment for most ADF solutions. I won’t say never, but I have yet to encounter a situation where this would help.

When you deploy your data factory, whether you use ARM templates or individual JSON files, you are publishing to the live ADF service. This is what happens when you publish from your collaboration branch in source control to the live version in your development data factory. If you follow normal deployment patterns, you deploy from the main (if you use JSON files) or adf_publish (if you use the ARM template) branch in source control to the live service in Test (or whatever your next environment is). If your Test data factory is connected to a repo, you need to figure out how to get your code into that repo.

Would you copy your code to the other repo before you deploy to the service? What if something fails in your deployment process before deployment to the live service is complete? Would you want to revert your changes in the Git repo?

You could deploy to the live service first and skip that issue. But you still need to decide how to merge your code into the Test repo. You’ll need to handle any merge conflicts. And you’ll likely need to allow unrelated histories for the merge to work, so when you look back in your commit history, it probably won’t make sense.

A comparison of the normal process vs the process with one repo per environment. In the normal process you work in ADF Studio and save to your Git repo. You publish to higher environments from that Git repo. 
with multiple repos, you have the extra step per environment of getting the code you just deployed to the live service into the associated environment's repo.
Comparison of having one repo vs one repo per environment shows the need for extra steps if you add repos

At best, this Test repo becomes an additional place to store the code that is currently in Test. But if you are working through a healthy development process, you already had this code in your Dev repo. You might have even tagged it for release, so it’s easy to find. Your Git repo is likely already highly available, if it is cloud-based. In my mind, this just creates one more copy of your code that can get out of date, and one more deployment step. If you just want a copy of what is in Test or Prod for safe keeping, you can always export the resource’s ARM template. But if I were to do that, I would be inclined to keep it in blob storage or somewhere outside of a repo, since I already have the code in a repo. This would allow me to redeploy if my repo weren’t available.

Then, once you have sufficiently tested your data factory in Test, would you deploy code to Prod from the Test repo or from the Dev repo?

If you have the discipline and DevOps/automation capabilities to support these multiple repos, you likely don’t want to do this, unless you have requirements that mandate it. That brings me to my second reason.

Deviation from Common DevOps Practice

Having a repo per environment is a deviation from common software engineering practices. Most software engineering projects do not have separate repos per environment. They might have separate repos for different projects within a solution, but that is a different discussion.

If you have a separate repo for dev and test, what do you do about history? I think there is also a danger that people would want to make changes in higher environments instead of working through the normal development process because it seems more expedient at the time.

When you hire new data engineers or dev ops engineers (whoever maintains and deploys your data factories), you would have to explain this process with the multiple repos as it won’t be what they are expecting.

Unless you have some special requirements that dictate this need, I don’t see a good reason to deviate from common practice.

Common Development Process

For a data factory project, we must define a collaboration branch, usually Main. This branch is the only branch that can publish to the live service in your Dev data factory. When you need to update your data factory, you make a (hopefully short-lived) feature branch based off of your collaboration branch. My preference for a medium to large project is to have the Main branch, an Integration branch, and one or more feature branches. The Integration branch brings multiple features together for testing before the final push to Main. On smaller projects with one or two experienced developers, you may not need the integration branch. I find that I like the integration branch when I am working with people who are new to ADF, as it gives me a chance to tweak and execute new pipelines before they get to Main.

Developers work in the feature branches and then merge into the integration branch as they see fit. They resolve any errors and make any final changes in integration and then create a pull request to get their code into Main. Once the code is merged into Main and published to the live service (either manually or programmatically), the feature branches and Integration branch are deleted, preparing you to start the next round of development. Triggering the pipelines in the live service after publishing gives you a more realistic idea of execution times as ForEach activities may not run in parallel when executed in debug mode.

We start with the Main branch. Integration and feature branches start as a copy of Main. Changes are made to Feature branches. When feature development is complete, the feature branch is merged into Integration. Once all features are merged into Integration, final review and updates are made, and then code is deployed to Main and published to the service.
The development starts by creating feature branches and an integration branch from Main. Code is merged into Integration by the developer. Code in integration is moved to Main via pull request. Code from Main is published to the live service.

The code in Main should represent a version of your data factory that is ready to be deployed to Test. Code is deployed from Dev to Test according to your preference—I won’t get into all the options of JSON files vs ARM templates and DevOps pipelines vs PowerShell/custom code in this post.

You perform unit testing, integration testing, and performance testing (and any other type of testing as well, but most people aren’t really doing these three in any sufficient manner) in your Test data factory. If everything looks good, you deploy to Production. If there are issues, you go back to your development data factory, make a new feature branch from Main, and fix the issue.

If you find a bug in production, and you can’t use the current version of code in Main, you might want to create a hotfix/QFE. Your hotfix is still created in your development data factory. The difference is that instead of creating a feature branch from Main, you create the branch from the last commit deployed to production. If you are deploying via ARM templates, you can export the ARM template from that hotfix branch and manually check it in to the adf_publish branch. If you deploy from JSON files, selective deployment is a bit easier. I like to use ADF Tools for deployment, which allows me to specify which files should be deployed, so I can do a special hotfix deployment that doesn’t change any other objects that may have already been updated in Main in anticipation of the next deployment.

In Summary

Having a repo per environment doesn’t technically break anything, but it adds complexity without significant benefits. It adds steps to your deployment process and deviates from industry standards. I won’t go so far as saying “never”, as I haven’t seen every project scenario. If you were considering going this route, I would encourage you to examine the reasons behind it and see if doing so would actually meet your needs and if your team can handle the added complexity.

Azure, Microsoft Technologies, SQL Server

Check if File Exists Before Deploying SQL Script to Azure SQL Managed Instance in Azure Release Pipelines

I have been in Azure DevOps pipelines a lot recently, helping clients set up automated releases. Many of my clients are not in a place where automated build and deploy of their SQL databases makes sense, so they deploy using change scripts that are reviewed before deployment.

We chose release pipelines over the YAML pipelines because it was easier to manage for the client and pretty quick to set up. While I had done this before, I had a couple of new challenges:

  • I was deploying to an Azure SQL managed instance that had no public endpoint.
  • There were multiple databases for which there may or may not be a change script to execute in each release.

This took a bit longer than I expected, and I enlisted my coworker Bill to help me work through the final details.

Setup

There are a few pre-requisites to make this work.

Self-hosted Agent

First, you need to run the job on a self-hosted agent that is located on a virtual machine that has access to the managed instance. This requires downloading and installing the agent on the VM. We ended up installing this on the same VM where we have our self-hosted integration runtime for Azure Data Factory, at least for deployment from dev to test. This was ok because the VM had enough resources to support both, and deployments are not long or frequent.

When you create a new self-hosted agent in Azure DevOps, a modal dialog appears, which provides you with a link to download the agent and a command line script to install and configure the agent.

Dialog from Azure DevOps titled "Get the agent". There are options for Windows, macOS, and Linux. There is a button to copy a link to download the agent. Then there is a set of command line scripts to create the agent, and configure the agent.
Dialog showing the instructions to download and install the Azure Pipelines Self-hosted agent

We configured the agent to run as a service, allowing it to log on as Network Service. We then could validate it was running by opening the Services window and looking for Azure Pipelines Agent.

You must add an inbound firewall rule for the VM to allow the traffic from the IP addresses related to dev.azure.com. This can be done in the Azure portal on the Networking settings for the VM.

Then you must install the SqlServer module for PowerShell on the VM. And you must also install sqlpackage on the VM. We also updated the System Path variable to include the path to sqlpackage (which currently gets installed at C:\Program Files\Microsoft SQL Server\160\DAC\bin).

DevOps Permissions

Azure DevOps has a build agent with a service principal for each project. That service principal must have Contribute permissions on the repo in order to follow my steps.

Database Permissions

If you want the Azure DevOps build agent to execute a SQL script, the service principal must have appropriate permissions on the database(s). The service principal will likely need to create objects, execute stored procedures, write data, and read data. But this depends on the content of the change scripts you want to execute.

The Pipeline

In this project we had one pipeline to release some analytics databases and Azure Data Factory assets from dev to test.

Therefore, we included the latest version of the repos containing our database projects and data factory resources as artifacts in our pipeline. While the database objects and data factory resources were related, we decided they weren’t dependent upon each other. It was ok that one might fail and one succeed.

Azure DevOps release pipeline with 2 artifacts (database code and ADF code) and 2 stages (Database deployment and ADF Deployment).

Side note: I love the ADF Tools extension for Azure DevOps from SQL Player. It is currently my preferred way to deploy Azure Data Factory. I find it is a much better experience than ARM template deployment. So that is what I used here in this ADF Deploy job.

Repo Setup

Getting back to databases, we had 3 databases to which we might want to deploy changes. We decided that we would set a standard of having one change script per database. We have an Azure DevOps repo that contains one solution with 3 database projects. These databases are highly related to each other, so the decision was made to keep them all in one repo. I added a folder to the repo called ChangeScripts. ChangeScripts contains two subfolders: Current and Previous. When we are ready to deploy database changes, we add a .sql file with the name of the database into the Current folder. For any deployment, there may be 1, 2, or 3 change scripts (1 for each database that has changes). When the release pipeline runs, the change scripts in the Current folder are executed and the files are renamed and moved into the Previous folder. This is why my DevOps service principal needed Contribute permissions on the repo.

Here are the tasks in my Database Deployment job:

  1. Check if SQL Scripts exist in the expected location.
  2. If a script exists for Database1, execute the script.
  3. If a script exists for Database 2, execute the script.
  4. If a script exists for Database 3, execute the script.
  5. Rename the change scripts to fit the pattern [DatabaseName][yyyyMMddHHmm].sql and move them to the Previous folder.

Configuring the Tasks

The first thing we must do is set the agent job to run using the agent pool containing the self-hosted agent. There is another important setting under Additional options for Allow scripts to access the OAuth token. This allows that service principal to authenticate and execute the SQL scripts, so make sure that is checked.

My first task, called Check for SQL Scripts is a PowerShell task. I used an inline script, but you can also reference a file path that is accessible by the agent. My script looks for the three files and then sets a pipeline variable for as true or false to indicate the presence each file. This variable is used later in the process.

My task is configured with the working directory set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current. There are no environment variables or output variables.

Here’s my PowerShell script:

$DB1Script = "Database1.sql"
$DB2Script = "Database2.sql"
$DB3Script = "Database3.sql"

$IsDB1Script = Test-Path -Path $DB1Script -PathType Leaf
$IsDB2Script =  Test-Path -Path $DB2Script -PathType Leaf
$IsDB3Script = Test-Path -Path $DB3Script -PathType Leaf

Write-Host $IsDB1Script  $IsDB2Script $IsDB3Script

Write-Output ("##vso[task.setvariable variable=DB1Script;]$IsDB1Script")

Write-Output ("##vso[task.setvariable variable=DB2Script;]$IsDB2Script")

Write-Output ("##vso[task.setvariable variable=DB3Script;]$IsDB3Script")

We got stuck on the syntax for setting variables for a while. Most of the things we found in Microsoft Docs and on blogs were close, but didn’t work. Finally, we found this blog, which showed us the syntax that worked – be careful with the quotes, semi-colon, and parentheses.

My second task executes the script for Database1. It uses the Azure SQL Database deployment task. When you use this task, you can specify whether it deploys a dacpac or executes a SQL script. We chose to execute a SQL script that we stored in the same repo as the database projects. For this task, you must specify the Azure Subscription, the target server, and the target database. I set my authentication type to Service Principal, since I had granted access to the Azure DevOps service principal during setup. My Deploy type was set to SQL Script File. And my SQL Script location was set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current/Database1.sql. An important configuration on this task is to set Run this task to Custom Conditions. I specified my custom condition as eq(variables['DB1Script'], 'True'). This tells the task to execute only if the variable we set in the previous task is equal to True. It seems data types don’t really matter here – I set the variable in the PowerShell script as a boolean, but it was converted to a string in Azure DevOps.

My third and fourth tasks do the same activities for Database2 and Database3.

My last task, called Move Change Scripts to Previous Folder, is another inline PowerShell script. The script uses the variables set in the first task to determine if the file exists and needs to be moved. Because the script is working on a local copy on the VM, it must commit the changes back to the git repo after renaming and moving the file(s). The Working Directory for this task is set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts.

Here’s the code:

# Write your PowerShell commands here.

Write-Host "Getting Files"
$path = Get-Location
Write-Host $path


$File1 = "Current/Database1.sql"
$File2 = "Current/Database2.sql"
$File3 = "Current/Database3.sql"

$IsDB1Script = Test-Path -Path $File1 -PathType Leaf
$IsDB2Script = Test-Path -Path $File2 -PathType Leaf
$IsDB3Script = Test-Path -Path $File3 -PathType Leaf

$DateString = Get-Date -Format "yyyyMMddHHmm"

$NewFile1 = "Previous/Database1$($DateString).sql"
$NewFile2 = "Previous/Database2$($DateString).sql"
$NewFile3 = "Previous/Database3$($DateString).sql"


 Write-Host "Changing file locations"
    git config --global user.email "myemail@domain.com"
    git config --global user.name "My Release Pipeline"

   
if($IsDB1Script) 
{
    $File1 |  git mv ($File1) ($NewFile1) 
    Write-Host "Moved Database1 script"
}

if($IsDB2Script)
{
    $File2 |  git mv ($File2) ($NewFile2) 
  Write-Host "Moved Database2 script"
}

if($IsDB3Script)
{
  $File3 |  git mv ($File3) ($NewFile3) 
  Write-Host "Moved Database3 script"
}

git commit -a -m 'Move executed change scripts to Previous Folder'
git push origin HEAD:master

Anything you output from your PowerShell script using Write-Host is saved in the release logs. This is helpful for troubleshooting if something goes wrong.

As you can see above, you can execute git commands from PowerShell. This Stack Overflow question helped me understand the syntax.

And with those five steps, I’m able to execute change scripts for any or all of my three databases. We don’t anticipate adding many more databases, so this configuration is easily manageable for the situation. I wouldn’t do this (one step per database) for a situation in which I have dozens or hundreds of databases that may be created or deleted at any time.

Azure, Azure Data Factory, KQL, Microsoft Technologies

Looking at Activity Queue Times from Azure Data Factory with Log Analytics

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.

Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

You can see queue time in the ADF Monitor by looking at the output of an activity.

Azure Data Factory pipeline execution details in the ADF monitor. The activity output is open, showing the resulting JSON. There is a property called durationInQueue, which contains the queue time in seconds. The result shown is 2 seconds.
Output from a stored procedure activity showing 2 seconds of queue time

But what if you want to see activity queue times across activities, across pipelines, or even across data factories? Then you need to output your logs to somewhere that makes this easier.

Send ADF Logs to Log Analytics

You can output your ADF logs to a storage account, to Log Analytics, to an event hub, or to a partner solution. I prefer Log Analytics because it’s easy to query and look for trends using KQL.

To configure the output to Log Analytics, you must create a Log Analytics workspace (if you don’t have an existing one) and add a diagnostic setting to the data factory resource. Once you have data feeding into Log Analytics, you can query it.

If you choose resource-specific destination tables in the diagnostic setting, you will find a table in Log Analytics called ADFActivityRun. This table contains a column called Output. The Output column contains the JSON we see in the ADF Studio Monitor app.

KQL has functions for parsing JSON and retrieving only the JSON objects I want to include. This means that I could write a query like the following.

ADFActivityRun
| extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue',Output, typeof(int))
| where Status == 'Succeeded'
| where queuetime > 0
| project Start, End, PipelineName, ActivityType, ActivityName, dur = datetime_diff('second', End, Start), queuetime, PipelineRunId, ActivityRunId
| sort by queuetime desc

This query gives me a list of activities with successful executions that have queue times greater than zero. I can choose from any columns in the ADFActivityRun table, including the pipeline and activity names, start and end times, activity types, and run IDs. Duration is not an available column so I had to calculate it by calculating the difference between the start and end time. The queue time is buried in the JSON in the Output column, so I used the extractjson function to get the duration in queue value.

Now that I know how to get the queue duration, I can look for trends across various slices of data. A query to get average queue time by activity type might look like the below.

ADFActivityRun
| where Status == 'Succeeded'
| where startofday(Start) == datetime(2022-01-04)
| extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue', Output, typeof(int))
| summarize avg_queuetime = avg(queuetime) by ActivityType
| sort by avg_queuetime desc

In this query, I am retrieving activities with successful executions that started on January 4, 2022. I added my calculation to retrieve queue time. Then I calculated average queue time by activity type using the summarize operator and sorted the result descending by queue time.

I could have filtered on any other available activity: pipeline name, activity name, integration runtime, resource ID, and more.

This information is available via the API as well, but a Log Analytics workspace can be spun up and running in minutes without having to write code to connect.

If you are used to writing SQL, the transition to KQL isn’t too bad. Check out the SQL to Kusto query translation page in Microsoft Docs.

Azure, Azure Data Factory, Microsoft Technologies

When You Can’t Change the Connected Git Repo on ADF

I was working on an Azure Data Factory project for a client who is new to ADF, and there was a miscommunication about the new Git Repo to be used for source control. Someone had created a new project and repo instead of using the existing one created for this purpose. This isn’t a big deal, as it’s easy enough to change in ADF Studio.

The Git Configuration page in Azure Data Factory Studio shows the connected repository and has a button to disconnect from the repository.
The Management Hub in Azure Data Factory Studio contains the Git configuration settings

In the Management Hub, you can change the Git configuration for the data factory, and there is a button near the top to disconnect the repo. You may need to do this if you run into conflicts in the publish branch or when you need to change repos.

It should be as easy as that, but I ran into a situation where it wasn’t.

Disconnect Button Unavailable

When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.

I asked my client to disconnect the repo and moved on with the project, but I also logged some feedback for the Data Factory team. You can do this by selecting the Feedback button near the top right of the page in Azure Data Factory Studio. I have done this a few times and always received a prompt response.

The feedback button in ADF Studio is selected. It opens a new dialog that allows you to choose from two options: "I have feedback" and "I have a feature suggestion". From there, you can enter your email address and your feedback.
You can send feedback from within Azure Data Factory Studio

I sent my feedback and received an email response within a couple of days. It contained a few helpful bits of information.

To disconnect git from the ADF management tab, you must be in git mode and be able to access the repository. If you are not able to access the repository but have permissions to update the factory, you can remove the git configuration via the REST API. Here is documentation on this API call: Factories – Configure Factory Repo – REST API (Azure Data Factory) | Microsoft Docs. In this case, repoConfiuration should be set to null in the request body, and the rest of the PUT body should be the same as the existing settings.

Alternatively, the git connection can be removed from the ADF management tab by either gaining access to the repository or having another ADF user with access remove the configuration. I also understand that this is not an ideal experience, so I have filed a work item to remove the requirement to have access to the repository.

This identified my situation exactly. I did not have access to the new repo that had been connected to the data factory. That caused me to be unable to disconnect it.

Hopefully the work item that removes the requirement for repo access before you can disconnect the data factory will be completed soon, but if you run into this issue, you can resolve it on your own using the API or a colleague with repo access.

Azure, Azure Data Factory, DCAC, Microsoft Technologies

Slides and Video from Building a Regret-free Foundation for your Data Factory Now Available

Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.

Here’s our agenda from the presentation.

Slide showing top regrets of data factory users: Poor resource organization in Azure
Lack of naming conventions
Inappropriate use of version control
Tedious, manual deployments
No/inconsistent key vault usage
Misunderstanding integration runtimes
Underutilizing parameterization
Lack of comments and documentation
No established pipeline design patterns
List of top regrets from Data Factory users that they wish they had understood from the beginning

If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.

If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.

I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.

Azure, Azure Data Factory, Azure SQL DB, Microsoft Technologies, PowerShell

Thoughts on Unique Resource Names in Azure

Each resource type in Azure has a naming scope within which the resource name must be unique. For PaaS resources such as Azure SQL Server (server for Azure SQL DB) and Azure Data Factory, the name must be globally unique within the resource type. This means that you can’t have two data factories with the same name, but you can have a data factory and a SQL server with the same name. Virtual machine names must be unique within the resource group. Azure Storage accounts must be globally unique. Azure SQL Databases should be unique within the server.

Since Azure allows you to create a data factory and a SQL server with the same resource name, you may think this is fine. But you may want to avoid this, especially if you plan on using system-defined managed identities or using Azure PowerShell/CLI. And if you aren’t planning on using these things, you might want to reconsider.

I ran into this issue of resources with the same name in a client environment and then recreated it in my Azure subscription to better understand it.

I already had a data factory named adf-deploydemo-dev so I made an Azure SQL server named adf-deploydemo-dev and added a database with the same name.

A data factory named adf-deploymentdemo-dev, a SQL Server named adf-deploymentdemo-dev, and a database named adf-deploymentdemo-dev
A data factory, a SQL Database, and a SQL Server all with the same name in the same region and same resource group

Azure Data Factory should automatically create its system-assigned managed identity. It will use the resource name for the name of the service principal. When you go to create a linked service in Azure Data Factory Studio and choose to use Managed Identity as the authentication method, you will see the name and object ID of the managed identity.

Managed identity name: adf-deploymentdemo-dev. Managed identity object ID: 575e8c6e-dfe6-4b5f-91be-40b0f0b9643b
Information shown in my data factory when creating a linked service for a storage account.

For the Azure SQL Server, we can create a managed identity using PowerShell. The Set-AzSqlServer cmdlet has an -AssignIdentity parameter, which creates the system-assigned managed identity.

Executing PowerShell command: Set-AzSqlServer -AssignIdentity -ResourceGroupName 'ADFDemployDemoDev' -ServerName 'adf-deploydemo-dev'
Executing the PowerShell command to create a managed identity

If you use Get-AzSqlServer to retrieve the information and assign the Identity property to a variable, you can then see the system-assigned managed identity and its application ID.

Executing PowerShell command: $S = Get-AzSqlServer -ResourceGroupName 'ADFDemployDemoDev' -ServerName 'adf-deploydemo-dev'
$S.Identity
The results show principalID, Type, and TenantID
Verifying the managed identity is in place for an Azure SQL server.

Now when I look in Active Directory, I can see both managed identities have the same name but different application IDs and object IDs.

Two managed identities in AAD, both called adf-deploymentdeo-dev.
Two managed service principals used for managed identities that have the same name but different IDs

Everything is technically working right now, but I have introduced some needless ambiguity that can cause misunderstandings and issues.

Let’s say that I want to grant the Storage Blob Data Reader role to my data factory. I go to the storage account, choose to add a role assignment, select the role, and then go to add members. This is what I see:

The user interface to select members to add to a role assignment shows users and service principals by name, so ti contains two objects named adf-deploydemo-dev
Which managed identity belongs to the data factory?

Or let’s say that I use PowerShell to get lists of resources by name. I may be locating resources to add tags, add a resource lock, or move the resource to another region or resource group.

Executing PowerShell command Get-AzResource - Name 'adf-deploydemo-dev' | ft
Getting resources by name returns all three resources

If I don’t specify the resource type, I will get my data factory, my database, and my server in the results. You may be saying “Well, I would always specify the type.” Even if that is true, are you sure all coworkers and consultants touching your Azure resources would do the same?

Why introduce this ambiguity when there is no need to do so?

There are some good tips in the Cloud Adoption Framework in Microsoft Docs about naming conventions. Your organization probably wants to decide up front what names are acceptable and then use Azure Policy as well as good processes to ensure adherence to your defined conventions. And if I were the consultant advising you, I would suggest that resources within your tenant be unique across resource types. The suggestion in Docs is to use a resource type abbreviation at the beginning of your resource name. That would avoid the issue I have demonstrated above. Naming conventions should be adjusted to your organization’s needs, but the ones suggested in Docs are a good place to start if you need some help. It is beneficial to have some kind of resource naming convention beyond just whatever is allowed by Azure.

Azure, Azure SQL DB, Microsoft Technologies, T-SQL

Altering a Computed Column in a Temporal Table in Azure SQL

System-versioned temporal tables were introduced in SQL Server 2016. They provide information about data stored in the table at any point in time by storing an effective dated version of each row rather than only the data that is correct at the current time

You can alter a temporal table to add or change columns, but you must first turn off system versioning. Let’s look at an example.

CREATE TABLE [dbo].[DatabaseSize](
	 [DatabaseID] [varchar](200) NOT NULL 
	,[ServerName] [varchar](100) NOT NULL
	,[DatabaseName] [varchar](100) NOT NULL
	,[SizeBytes] [bigint] NULL
	,[SizeMB]  AS ([SizeBytes]/(1048576))
	,[ValidFrom] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL
	,[ValidTo] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL
	,PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
	,CONSTRAINT PK_DatabaseSize_DatabaseID PRIMARY KEY CLUSTERED (DatabaseID)
) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));

Temporal tables must have a primary key defined. They also must contain two datetime2 columns, declared as GENERATED ALWAYS AS ROW START / END. The statement above creates both the current table and a history table.

The history table has the same schema as the current table, with one difference: the SizeMB column in the history table is not a computed column.

The dbo.DatabaseSize table is a system-versioned table. The DatabaseSizeHistory table is the related history table. DatabaseSizeHistory contains the same columns as DatabaseSize, except the SizeMB column is not a computed column in the history table.


When I initially created the table, I typoed the formula in the computed column. You can’t alter a computed column — you must drop and recreate the column. This is no problem, just turn off system versioning and alter your table, and turn system versioning back on.

But if you try this without specifying your history table, you will find that it stops using the history table created earlier and makes a new history table.

dbo.DatabaseSize is a system versioned table. The history table now shows as dbo.MSSL_TemporalHistoryFor_1909581841

If you specify your history table while turning system versioning back on, you will encounter an error:

Setting SYSTEM_VERSIONING to ON failed because column 'SizeMB' at ordinal 5 in history table 'Test.dbo.DatabaseSizeHistory' has a different name than the column 'ValidFrom' at the same ordinal in table 'Test.dbo.databasesize'.

Temporal tables match the columns between the current table and history table not only by name and data type but by the column’s ordinal position. Dropping and adding the computed column changed its order as it was added to the end of the table.

You can change the column order of a table in the SQL Server Management Studio UI by right-clicking on the table, selecting Design, and then dragging the column to the correct position. Note that you cannot do this on the system-versioned table while system versioning is on. You can either change the column order on the history table, or turn system versioning off and then change the current table.

dragging the SizeMB column to the bottom of the columns list in the table.

Once the column orders match, you can turn system versioning back on and specify the the history table.

ALTER TABLE [dbo].[DatabaseSize]
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));

This time, the command should complete successfully. You’ll want to drop any unused auto-created history tables before you’re finished.

Azure, Azure Data Factory, Microsoft Technologies

Control Flow Limitations in Data Factory

Control Flow activities in Data Factory involve orchestration of pipeline activities including chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline. They also include custom-state passing and looping containers.

The activities list in the ADF Author & Manage app, showing Lookup, Set variable, Filter, For Each, Switch, and more.
Control Flow activities in the Data Factory user interface

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.

  1. You cannot nest For Each activities.
    Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.
  2. You cannot put a For Each activity or Switch activity inside of an If activity.
    The Data Factory UI will prevent you from doing this by removing the For Each and Switch from the activity list. You can redesign the pipeline to put the inner activity inside a child pipeline. Also note that you can put an If activity inside of a For Each activity.
  3. You cannot use a Set Variable activity inside a For Each activity that runs in parallel.
    The Data Factory UI won’t stop you, but you’ll quickly learn that the scope of the variable is the pipeline and not the For Each or any other activity. So you’ll just overwrite the value in no particular order as the activities execute in parallel. The workaround for this is specific to your use case. You might try using an existing attribute of the item you are iterating on instead of setting a variable. Append Variable works fine, since each loop could add a value. But again, don’t count on the order being meaningful.
  4. You cannot nest If activities.
    The Data Factory UI will prevent you from nesting the If activities. If you need to have two sets of conditions, you can either combine conditions or move the inner condition to a child pipeline.
  5. You cannot nest Switch activities.
    Similar to the If activity, the Data Factory UI will prevent you from nesting Switch activities. And again, you can either combine conditions or move the inner condition to a child pipeline.
  6. You cannot put a For Each or If activity inside a Switch activity.
    The Data Factory UI will prevent you from doing this. You can move the inner activity to a child pipeline if needed.
  7. You cannot use an expression to populate the pipeline in an Execute Pipeline activity.
    It would be great to design a truly dynamic pipeline where you could have a dataset that defines which pipelines to execute, but you can’t do that natively in the Data Factory UI. The Invoked Pipeline property doesn’t allow dynamic expressions. If you need to dynamically execute pipelines, you can use Logic Apps or Azure Functions to execute the pipeline.
  8. You cannot dynamically populate the variable name in Set Variable and Append Variable activities.
    The Data Factory UI only allows you to choose from a list of existing variables. As a workaround, you could use an If activity to determine which variable you will populate.
  9. The Lookup activity has a maximum of 5,000 rows and a maximum size of 4 MB.
    If you need to iterate over more than 5000 rows, you’ll need to split your list between a child and parent pipeline.

In addition to the items mentioned above, also note these resource limits listed in Microsoft Docs. Limits like 40 activities per pipeline (including inner activities for containers) can bite you if you aren’t careful about implementing a modular design. And if you do have a modular design with lots of pipelines calling other pipelines, be aware that you are limited to 100 queued runs per pipeline and 1,000 concurrent pipeline activity runs per subscription per Azure Integration Runtime region. I don’t hit these limits too often, but I have hit them.

This is not to say you can’t create good solutions in Azure Data Factory—you absolutely can. But Data Factory has some limitations that you might not expect if you have experience working with other data integration/orchestration tools.

Have you hit any other limits that caused you to design your pipelines differently that you would like to share with others? If so, leave me a comment.

Azure, Azure Data Lake, Microsoft Technologies

Initial Thoughts on Dremio

I’ve been working on a project for the last few months with a client who has chosen to implement Dremio in Azure. Dremio is a data lake engine that creates a semantic layer and supports interactive queries.

Dremio logo
The Dremio logo

It uses Apache Arrow, Gandiva, and Parquet files under the hood. It runs on either Linux VMs or Kubernetes containers. Like most big data systems, there is at least one coordinator node and one or more executor nodes. These nodes communicate and are managed using Apache Zookeeper. Client applications connect to Dremio via ODBC, JDBC, REST APIs, or Arrow Flight. Dremio can read from storage accounts, external databases, and a few other sources.

Dremio stores data in the following places:

  • Metadata is stored in a RocksDB database on the coordinator node.
  • Frequently read data is cached on the executor node.
  • Memory-intensive query operations may cause an executor node to spill Arrow buffers from RAM to disk.
  • Reflections, user uploads, and query results are stored in the data lake.

Dremio is organized into spaces, which can contain folders and datasets. The key objects in Dremio are:

  • Data source – connection strings to data that should be accessed via Dremio
  • Physical Dataset – an HDFS directory or a database table
  • Virtual Dataset – a view of sorts, created using the Dremio UI or by writing SQL, that references one or more physical or virtual datasets and also provides lineage to its sources
  • Reflection – a materialized view that is transparent to users and is used to improve query performance, which seems to be implemented as Dremio querying data from the source and storing it as a parquet file for quicker access.
  • Space – a shared location for virtual datasets, a way to group related datasets and provide user access

Once you have your spaces and virtual datasets set up, it feels kind of like a database. If you connect with Power BI, virtual datasets appear as views and physical datasets appear as tables. Dremio metadata (catalogs, schemas, physical datasets, virtual datasets and columns) can be accessed using INFORMATION_SCHEMA queries, which is conveniently familiar if you are used to working with SQL Server.

Some nice features found in Dremio on Azure

  • Dremio allows Single Sign-On with AAD credentials. Permissions can be granted to individual users or AAD groups.
  • Dremio can be implemented in a virtual network in Azure. The executor nodes can use Private Link to access ADLS (Azure Data Lake Storage Gen 2) over a private endpoint.
  • Changes to virtual datasets are tracked in Dremio. It’s easy to revert to a previous version at any time.
  • Dremio gives you visibility to the jobs running queries, both for ad hoc queries from client tools and for refreshing reflections.
  • Administrators can create rules to assign queries to different queues in order to provide workload isolation and predictability for users.
  • When reviewing jobs, you can see a sort of query plan as well as which jobs were able to use a reflection to accelerate a query.
  • The lineage view for a virtual dataset is nice for understanding dependencies.
  • You can trigger refreshes of metadata or reflections via the Rest API, which is handy if you have ETL processes adding new data to your data lake, and you want to refreshes to occur at the end of the ETL process.

Some rough edges on Dremio in Azure

  • Dremio was initially built for AWS, not Azure. This is evident in the training materials, the product roadmap, and the knowledge of the Dremio implementation specialists. This is not to say it doesn’t work on Azure, just that the implementation is a bit rougher (e.g., no Azure templates made for you), and a couple of features are unavailable.
  • Dremio doesn’t integrate with Azure Key Vault. You store the service principal secret or storage account access key in a configuration file on the Linux VM. I’ve been told this is on the roadmap, but I didn’t hear a date when it would be available.
  • You can enable integration points on the Dremio website where you can click a button to open a connection to a virtual dataset in a BI tool such as Power BI or Tableau. For Power BI, this downloads a PBIDS file with a connection to that specific virtual dataset. This would be fine if everything you need is in this one dataset, but if you need to reference multiple virtual datasets, this is a bit annoying. Think of it like connecting to a specific database table instead of to the database in general. You might want to use that table, but you might also want to find other useful tables to combine in your Power BI model. You can open Power BI and connect to Dremio in general and navigate from there with no problems. I’m just pointing out that the buttons in the UI don’t seem that useful.
  • Dremio doesn’t support passthrough authentication on ADLS. All queries to the data lake are made in the context of the Dremio application, not the individual user. This means that you may need to set permissions twice for your data lake if you have other tools directly accessing the data lake instead of using Dremio. The idea is that most tools will connect through Dremio to take advantage of the semantic layer. But it would be nice to have, just to simplify security.

Advice we received in training

  • Unlike with nesting views in SQL Server, it’s ok to create multiple layers of virtual datasets. You want to design the semantic layer (the virtual datasets) to reuse common logic instead of repeating it across multiple views.
  • The standard design pattern for the semantic layer is to have a layer of “staging views” that have a 1-to-1 mapping to physical datasets and very little transformation outside of fixing data types and light cleansing. On top of the Staging layer is the Business layer, which includes virtual datasets containing business logic. The Business layer should handle most of the query workload. On top of the Business layer is the Application Layer. This includes virtual datasets that are purpose-built to support specific applications or reports.
  • Star schemas are not optimal in Dremio. You likely want to denormalize even more than that. This is because it is more expensive to perform a join than to search through a large number of values in a column.
  • When creating a reflection, setting the sort column is somewhat like creating an index in a SQL database. It helps prune data when applying a query filter or performing a join.
  • Reflections can be used to partition data. If you find you have a single large file, you can use a Reflection to split it by a low cardinality value to improve query performance. When you do this, it creates a parquet file per partition.
  • Reflections can be set to use an incremental refresh, but only if the data is additive and existing data is not updated.
  • You don’t need a reflection for everything. Make them as small and reusable a possible.
  • Try to avoid thousands of tiny files, and aim for a few medium to large files (MBs to GBs). This is common for most data lake engines as there is an overhead cost for file enumeration.

Some other thoughts

  • Dremio advertises that you don’t need data integration processes like you would for a data warehouse. I find this to be somewhat inaccurate for two main reasons. First, if you need to acquire data from APIs or other applications to which Dremio can’t connect, you will still need to copy data to your data lake. Second, when you use a Reflection to speed up a query, you are creating a copy of the data in your data lake stored as one or more Parquet files. Data virtualization technology hasn’t actually matured to the point of not needing ETL at all. I can see how Dremio would lessen the need for ETL, but let’s recognize that you’ll probably still need some and that Dremio is doing a bit of data loading of it’s own. So the question becomes where — and with what tools — you would like to do this. You can have Dremio do your transforming and loading in the form of reflections, or you can load your own data already transformed to the data lake. You will likely end up with a bit of both over time.
  • Consider the skillsets of the people who will manage the system, as well as those who will build and query the datasets. If you have a team of admins who only know Windows, they are going to need to skill up on Linux. If your BI team or analysts don’t know SQL, they will probably struggle to build the virtual datasets.
  • This system can get pretty expensive pretty fast (which is true of most big data systems). You’ll want to be sure to automate the shutdown of the nodes in dev and test environments when they are not in use, so you can save a bit of money. And remember that you can size up your nodes later if you find you don’t have adequate performance. Oversizing at the outset will waste money.
  • Dremio is a (well-funded) startup with a product that is built on several open source technologies, and they don’t seem to have a public roadmap. In my experience, they have been good about taking feedback to add to the roadmap and with sharing what is soon to be released. But if you are building your company’s BI strategy with Dremio as a key tool, you probably want more than that. It sounds like they share more with paying customers. I would want that information before making a purchasing decision.
  • Overall, I can see why Dremio has been adopted by several large companies. And I have enjoyed setting up the Azure architecture around it and building virtual datasets. I wish they would add some Azure-specific features to optimize things and make security easy, but it’s a promising platform.

More Information about Dremio

If Dremio sounds interesting to you, here are a few helpful links

This was my first project using Dremio. If you’ve used Dremio, please share your experience in the comments.