Azure, Azure Data Factory, Microsoft Technologies, Power BI

Refreshing a Power BI Dataset in Azure Data Factory

I recently needed to ensure that a Power BI imported dataset would be refreshed after populating data in my data mart. I was already using Azure Data Factory to populate the data mart, so the most efficient thing to do was to call a pipeline at the end of my data load process to refresh the Power BI dataset.

Power BI offers REST APIs to programmatically refresh your data. For Data Factory to use them, you need to register an app (service principal) in AAD and give it the appropriate permissions in Power BI and to an Azure key vault.

I’m not the first to tackle this subject. Dave Ruijter has a great blog post with code and a step-by-step explanation of how to use Data Factory to refresh a Power BI dataset. I started with his code and added onto it. Before I jump into explaining my additions, let’s walk through the initial activities in the pipeline.

ADF pipeline that uses web activities to gets secrets from AKV, get an AAD auth token, and call the Power BI API to refresh a dataset. Then and Until activity and an If activity are executed.
Refresh Power BI Dataset Pipeline in Data Factory

Before you can use this pipeline, you must have:

  • an app registration in Azure AD with a secret
  • a key vault that contains the Tenant ID, Client ID of your app registration, and the secret from your app registration as separate secrets.
  • granted the data factory managed identity access to the keys in the key vault
  • allowed service principals to use the Power BI REST APIs in in the Power BI tenant settings
  • granted the service principal admin access to the workspace containing your dataset

For more information on these setup steps, read Dave’s post.

The pipeline contains several parameters that need to be populated for execution.

ADF pipeline parameters

The first seven parameters are related to the key vault. The last two are related to Power BI. You need to provide the name and version of each of the three secrets in the key vault. The KeyVaultDNSName should be https://mykeyvaultname.vault.azure.net/ (replace mykeyvaultname with the actual name of your key vault). You can get your Power BI workspace ID and dataset ID from the url when you navigate to your dataset settings.

The “Get TenantId from AKV” activity retrieves the tenant ID from the key vault. The “Get ClientId from AKV” retrieves the Client ID from the key vault. The “Get Secret from AKV” activity retrieves the app registration secret from the key vault. Once all three of these activities have completed, Data Factory executes the “Get AAD Token” activity, which retrieves an auth token so we can make a call to the Power BI API.

One thing to note is that this pipeline relies on a specified version of each key vault secret. If you always want to use the current version, you can delete the SecretVersion_TenantID, SecretVersion_SPClientID, and SecretVersion_SPSecret parameters. Then change the expression used in the URL property in each of the three web activities .

For example, the URL to get the tenant ID is currently:

@concat(pipeline().parameters.KeyVaultDNSName,'secrets/',pipeline().parameters.SecretName_TenantId,'/',pipeline().parameters.SecretVersion_TenantId,'?api-version=7.0')

To always refer to the current version, remove the slash and the reference to the SecretVersion_TenantID parameter so it looks like this:

@concat(pipeline().parameters.KeyVaultDNSName,'secrets/',pipeline().parameters.SecretName_TenantId,'?api-version=7.0')

The “Call Dataset Refresh” activity is where we make the call to the Power BI API. It is doing a POST to https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes and passes the previously obtained auth token in the header.

This is where the original pipeline ends and my additions begin.

Getting the Refresh Status

When you call the Power BI API to execute the data refresh, it is an asynchronous call. This means that the ADF activity will show success if the call is made successfully rather than waiting for the refresh to complete successfully.

We have to add a polling pattern to periodically check on the status of the refresh until it is complete.

We start with an until activity. In the settings of the until loop, we set the expression so that the loop executes until the RefreshStatus variable is not equal to “Unknown”. (I added the RefreshStatus variable in my version of the pipeline with a default value of “Unknown”.) When a dataset is refreshing, “Unknown” is the status returned until it completes or fails.

ADF Until activity settings

Inside of the “Until Refresh Complete” activity are three inner activities.

ADF Until activity contents

The “Wait1” activity gives the dataset refresh a chance to execute before we check the status. I have it configured to 30 seconds, but you can change that to suit your needs. Next we get the status of the refresh.

This web activity does a GET to the same url we used to start the dataset refresh, but it adds a parameter on the end.

https://docs.microsoft.com/en-us/resGET https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes?$top={$top}

The API doesn’t accept a request ID for the newly initiated refresh, so we get the last initiated refresh by setting top equal to 1 and assume that is the refresh for which we want the status.

The API provides a JSON response containing an array called value with a property called status.

In the “Set RefreshStatus” activity, we retrieve the status value from the previous activity and set the value of the RefreshStatus variable to that value.

Setting the value of the RefreshStatus variable in the ADF pipeline

We want the status value in the first object in the value array.

The until activity then checks the value of the RefreshStatus variable. If your dataset refresh is complete, it will have a status of “Completed”. If it failed, the status returned will be “Failed”.

The If activity checks the refresh status.

If activity expression in the ADF pipeline

If the refresh status is “Completed”, the pipeline execution is finished. If the pipeline activity isn’t “Completed”, then we can assume the refresh has failed. If the dataset refresh fails, we want the pipeline to fail.

There isn’t a built-in way to cause the pipeline to fail so we use a web activity to throw a bad request.

We do a POST to an invalid URL. This causes the activity to fail, which then causes the pipeline to fail.

Since this pipeline has no dependencies on datasets or linked services, you can just grab my code from GitHub and use it in your data factory.

Azure, Azure Data Factory, Logic Apps, Microsoft Technologies

Using Logic Apps in a Data Factory Execution Framework – Part 1

Data Factory allows parameterization in many parts of our solutions. We can parameterize things such as connection information in linked services as well as blob storage containers and files in datasets. We can also parameterize certain properties in activities. For instance, we can write an expression to determine the stored procedure to be executed in a Stored Procedure Activity or the filename in the sink (destination) of a Copy Activity.

But we cannot parameterize the invoked pipeline in an Execute Pipeline Activity. This means we need to find workarounds in order to have a metadata-driven execution framework. What I mean by metadata-driven execution framework is that data is stored in a datastore (in my case, a SQL Database) and used to determine what pipelines and activities get executed. With this type of framework, if I don’t want a specific pipeline to execute, I would just update my data in the datastore rather than delete the pipeline execution from the parent pipeline. We’ve been doing this type of development in SSIS for years, and Biml has played a big part in that. But SSIS allows us to parameterize the Execute Package Task.

Since we can’t implement this parameterized execution of pipelines natively, we need to look for something that Data Factory can call to accomplish the task. Paul Andrew has a nice framework that uses Azure Functions. I was working on a Data Factory solution for a client who doesn’t have C# or PowerShell developers on hand to help with the ELT process, so we needed to explore a low-code solution.

While there is no Logic App activity in Data Factory, we can use a Web Activity to call the Logic App. I might have a pipeline that looks something like what is pictured below.

Data Factory pipeline that uses a Stored Procedure to capture the start of the pipeline, a Lookup to get the list of files to be copied, a ForEach loop to copy each of the files, and a Stored Procedure to mark the end of the pipeline.
Staging pipeline that copies files from Azure Data Lake Storage to Azure SQL Database

Within the ForEach loop is a single Web Activity.

Data Factory Pipeline Web Activity calling a Logic App. An expression populates the url, and a Get m
Web Activity that calls a Logic App

I used some variables and parameters in an expression to populate the URL so it would be dynamic. I used a GET method in the call.

My initial version of my Logic App is shown below.

Logic App workflow with an HTTP request trigger. 1) Create a pipeline run. 2) Initialize Variable. 3) Until loop. 4) HTTP Response.
Logic App that executes a Data Factory pipeline and waits for it to complete before returning a response

I added path parameters in my HTTP request trigger to allow me to capture the information I need to execute the appropriate pipeline. For me this included the pipeline name, a data source ID, and a country. Your parameters would vary according to your requirements.

HTTP Request trigger in a logic app with 3 path parameters: pipeline, country, Data Source ID
HTTP Request trigger in my Logic App

Logic apps has an action called “Create a pipeline run”. You tell it which data factory, which pipeline, and any parameter values needed for the pipeline execution.

Create a pipeline run action in a logic app. Data Factory Pipeline Name is populated by a parameter. The pipeline parameters are populated by a mix of static JSON and parameters.
Create a pipeline run action in my Logic App

At this point in the workflow, our pipeline would be executing. But now we need to know when it has finished. That’s what the Initialize Variable and Until Loop actions are handling. I created a string variable called Pipeline Status and set the default value to “InProgress”. My Until loop action checks my pipeline execution status. If it’s still running, it waits 5 seconds, gets the new status, and assigns that status to the variable. This repeats until the pipeline execution is no longer in progress.

Here’s the expression I used to check whether the pipeline execution is still running:

@and(not(equals(variables('PipelineStatus'), 'InProgress')),
not(equals(variables('PipelineStatus'), 'Queued')))
Until loop in a logic app. Checks status of pipeline run. 1) Delay action. 2) Get a pipeline run. 3) Set variable.
Until loop in my Logic App to dynamically execute a Data Factory pipeline

Once the pipeline execution is complete, an HTTP response with the pipeline status is sent back to the caller.

HTTP Response action with status code 200 and pipeline status value in the body.
HTTP Response action in my Logic App

This is all great until you find out that Logic Apps will experience an HTTP timeout if the request takes more than 2 minutes.

Do you have any pipelines that take longer than two minutes to execute? If so, you need to change your solution to handle this. Note that you would have the same issue with Azure Functions, although it would give you 230 seconds instead of 120 seconds before it timed out. We need to switch to an asynchronous call to support long running pipelines. Paul has already done this in his framework using Azure Functions. In Logic Apps, we can change our response to an asynchronous response and then implement a polling pattern to check the status. We could alternatively implement a webhook action. I’ll write about updating the solution to handle long running pipelines in a future post.

Accessibility, Azure, Conferences, Microsoft Technologies, SQL Saturday

I Presented with Live Captioning and Sign Language Interpreters

I had the pleasure of presenting a full-day pre-conference session on the Friday before SQLSaturday Austin-BI last weekend. I could spend paragraphs telling you how enjoyable and friendly and inclusive the event was. But I’d like to focus on one really cool aspect of my speaking experience: I had both live captioning and sign language interpreters in my pre-con session.

First, let’s talk about the captions. While PowerPoint does have live captions/subtitles, that only works when you are using PowerPoint. When you show a demo or go to a web page, taking PowerPoint off the screen, you lose that ability. So we had a special setup provided by Shawn Weisfeld (Twitter|GitHub).

How the Live Captions Worked

A presenter uses a lavalier mic that sends audio to Epiphan Pearl. The presenter's computer sends video to Epiphan Pearl. Epiphan Pearl sends audio to a computer that sends audio to Azure and receives captions. The computer overlays the captions above the images from teh presenters laptop. That is all sent to the projector.
Technology setup at SQLSaturday Austin- BI Edition 2020 that provided live captions

The presenter connects their laptop to the Epiphan Pearl with an HDMI cable so they can send the video (picture) from the laptop. The speaker wears a lavalier microphone, which sends audio to the Pearl. The transcription green screen computer takes audio from the Pearl, sends it to Azure to be transcribed using Cognitive Services, and overlays the returned transcription text on a green screen input that is sent back to the Pearl. The projector gets the combined output of the transcription text and the presenter’s computer video output.

You can see an example of what it looked like from my presentation on Saturday in the tweet below. There are lots more pictures of it on Twitter with the #SqlSatAustinBI hashtag.

While this setup requires a bit more hardware, it worked so well! It took about 10 minutes to get it set up in the morning. As the speaker, I didn’t have to do anything but wear a mic. It transcribed everything I said regardless of what program my laptop was showing. There was very little lag. It seemed to be less than one second between when I would say something and when we would see it on the screen. While I try to speak clearly and slowly, sometimes I slip and fall back into speaking quickly. But the transcription kept up well. Some attendees said it was great to have the captions up on the screen to help them understand what I said when I occasionally spoke too quickly. The captions are placed at the top of the screen, above the image coming from my laptop, so I didn’t have to adjust my slides or anything to allow space for the captions.

The live captions were a big success. They helped not only people who had trouble hearing, but also those who spoke English as a second language and those who weren’t familiar with some of the terms I used and needed to see them spelled out.

Presenting With Sign Language Interpreters

This was my first time presenting with sign language interpreters to help communicate with my audience. Since the pre-con session lasted multiple hours, there were two interpreters in my room. They would switch places about every hour. They were kind enough to answer a few questions for me during breaks.

I asked them if it was difficult to sign all the technical terminology used and if they tried to study up on terms ahead of time. One of them told me that they don’t study the subject and they fingerspell all the technical terms. Most of my terms were spelled on my slides, and I saw the interpreter look at the slide to get the spelling. When someone asked a question about the font I was using, the interpreter asked me to spell it out, since it wasn’t written anywhere. I asked if having printed slides helped (I provided PDFs of the slides to the attendees at the beginning of the session). One of the interpreters told me no, because they were already watching the signer for questions and watching my slides and listening to me.

What I loved most about having the interpreters there was that the person using the service got to fully participate in the session. They asked questions and made comments like anyone else. And they participated in hands-on small group activities.

Check out this great photo of one of the interpreters in action during a small group activity.

5 people sit in a group at a table while a sign language interpreter sits across the table and helps the group communicate
Photo of small group activities during my Power BI pre-con with a sign language interpreter in the group. Photo by Angela Tidwell

Having ASL interpreters didn’t require any extra effort on my part. I didn’t have to practice with them beforehand or provide them with any of my conference materials. They were great professionals and were able to keep up with me through lecture, demos, small group exercises, and Q&A.

Sign language interpreters cost money. And they should – they provide a valuable service. In this case, the interpreters were provided by the State of Texas because the person using the service worked for the state government. Because this was training for their job, the person’s employer was obligated to provide this service. So we were lucky that it didn’t cost us anything.

While the SQLSaturday organizers were coordinating the ASL interpreters, they found out that there is a fund in Texas that can help with accessibility services when a person’s employer doesn’t/can’t provide them. It may not be the same in every state, but it’s definitely something to look into if you need to pay for interpreters for an event like this.

Make Your Next Event More Accessible

I have organized events, and I understand the effort that it requires. I’m so happy that Angela and Mike made the effort to make SQLSaturday Austin-BI a more inclusive event. I would like to challenge you to do the same for the next event you organize or the next presentation you give at a tech conference.

Your conference may not be able to afford the Epiphan Pearl (note: the original model we used is discontinued, but there is a new model) and the Azure costs. I’d like to see SQLSaturdays join together and purchase equipment and share across events – it would be great if PASS would help with this. Or maybe a company involved in the community could sponsor them? If we can’t do that, we could always start small with the built-in capabilities in PowerPoint and work our way up from there.

It was a great experience as a speaker and as an audience member to have the live captions. And I was so happy that someone wanted to attend my session and was making the effort to sign up and request the ASL interpreters. I hope we see more of that in the future. But we need to do our part to let people know that we welcome that and we will work to make it happen.

Azure, Azure Data Factory, Microsoft Technologies

Parameterizing a REST API Linked Service in Data Factory

We can now pass dynamic values to linked services at run time in Data Factory. This enables us to do things like connecting to different databases on the same server using one linked service. Some linked services in Azure Data Factory can be parameterized through the UI. Others require that you modify the JSON to achieve your goal.

Recently, I needed to parameterize a Data Factory linked service pointing to a REST API. At this time, REST APIs require you to modify the JSON yourself.

In order to pass dynamic values to a linked service, we need to parameterize the linked service, the dataset, and the activity.

I have a pipeline where I log the pipeline start to a database with a stored procedure, lookup a username in Key Vault, copy data from a REST API to data lake storage, and log the end of the pipeline with a stored procedure. My username and password are stored in separate secrets in Key Vault, so I had to do a lookup with a web activity to get the username. The password is retrieved using Key Vault inside the linked service. Data Factory doesn’t currently support retrieving the username from Key Vault so I had to roll my own Key Vault lookup there.

Data Factory pipeline containing a stored procedure, web activity, copy activity, and stored procedure
Pipeline with a parameterized copy activity

I have parameterized my linked service that points to the source of the data I am copying. My linked service has 3 parameters: BaseUrl, Username, and SecretName. The JSON for my linked service is below. You can see that I need to reference the parameter as the value for the appropriate property and also define the parameter at the bottom.

{
    "name": "LS_RESTSourceParam",
    "properties": {
        "annotations": [],
        "type": "RestService",
        "typeProperties": {
            "url": "@{linkedService().BaseUrl}",
            "enableServerCertificateValidation": true,
            "authenticationType": "Basic",
            "userName": "@{linkedService().Username}",
            "password": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "MyKeyVault",
                    "type": "LinkedServiceReference"
                },
            "secretName": "@{linkedService().SecretName}"
            }
        },
        "parameters": {
            "Username": {
                "type": "String"
            },
            "SecretName": {
                "type": "String"
            },
            "BaseUrl": {
                "type": "String"
            }
        }
    }
}

I have defined these three parameters in my dataset, along with one more parameter that is specific to the dataset (that doesn’t get passed to the linked service). I don’t need to set the default value on the Parameters tab of the dataset.

4 parameters defined in a data factory dataset: relativeURL, username, secret, and baseURL.
Parameters defined in the dataset

On the Connection tab of the dataset, I set the value as shown below. We can see that Data Factory recognizes that I have 3 parameters on the linked service being used. The relativeURL is only used in the dataset and is not used in the linked service. The value of each of these properties must match the parameter name on the Parameters tab of the dataset.

Connection tab of the dataset in data factory, showing 3 linked service properties and one additional dataset property.
Setting the properties on the Connection tab of the dataset

In my copy activity, I can see my 4 dataset parameters on the Source tab. There, I can write expressions to provide the values that should be passed through to the dataset, 3 of which are passed through to the linked service. In my case, this is a child pipeline that is called from a parent pipeline that passes in some values through pipeline parameters which are used in the expressions in the copy activity source.

The Source tab of the copy activity. It uses the parameterized dataset and contains expressions to set the values of the parameters.
Defining the expressions for the dataset properties on the copy activity source

And that’s it. I can run my pipeline and have it call different REST APIs using one linked service and one dataset.

Azure, Azure SQL DB, Microsoft Technologies, SQL Server

New Centralized View of SQL Resources in Azure

Yesterday some new views were made available in the Azure portal that will be helpful to those of us who create or manage Azure SQL resources.

First, a new guided approach to creating resources has been added to the Azure portal. We now have a unified experience to create Azure SQL resources that offers guidance as to the type of Azure SQL resource you need for your use case: SQL database, managed instance, or SQL Server virtual machine. This new Azure SQL blade under Marketplace offers a high-level description of each offering and the scenario that it best serves. If you already know what you want but are having trouble remembering exactly what the resource is called in the marketplace, this can also alleviate that issue.

New guided experience for creating SQL Azure resources in the Azure Portal
New guided experience for creating SQL Azure resources in the Azure Portal

Notice that SQL virtual machine images are a listed offering in the new experience. As Microsoft phrased it, “SQL Server on Azure VMs is now a first-class member of the Azure SQL family.” This blade gives you an easily accessible place to see all the SQL Server VM images without having to search through lots of other unrelated VM images.

A drop-down box listing all of the available SQL Server VM images
A drop-down box listing all of the available SQL Server VM images

Once your Azure SQL resources are created, you can use the new centralized management hub to administer them. Locate the Azure SQL resources blade to see a list of all of your single databases, database servers, elastic pools, managed instances, and virtual machines running SQL.

Centralized management hub for Azure SQL resources
Centralized management hub for Azure SQL resources

This is the foundation for a unified database platform in Azure with more consistency across offerings and more manageability features to come in the future. For more information, read the announcement from Microsoft or watch the new video they posted on Channel 9.

Azure, Microsoft Technologies, PowerShell, SQL Server

Using Azure Automation to Shut Down a VM only if a SQL Agent Job is Not Running

I have a client who uses MDS (Master Data Services) and SSIS (Integration Services) in an Azure VM. Since we only need to execute the SQL Agent job that runs the SSIS packages infrequently, we shut down the VM when it is not in use in order to save costs. We wanted to make sure that the Azure VM did not shut down when a specific SQL Agent job was still running, so I tackled this with some PowerShell runbooks in Azure Automation.

I split the job into two parts. The first runbook simply checks if a specified SQL Agent job is running and returns a text value that indicates whether it is running. A parent runbook checks if the VM is started. If the VM is started, it calls the child runbook to check if the job is running, and then shuts down the VM if the job is not running.

It’s fairly easy and convenient to have nested PowerShell runbooks in Azure Automation. There are two main ways to call a child runbook.

  1. Inline execution
  2. Using the Start-AzureRmAutomationRunbook cmdlet

It was less obvious to me how to call a child runbook when the parent runs in Azure and the child runs on a hybrid worker, especially when you need to use the output from the child runbook in the parent. A hyrid runbook worker allows us to access resources that are behind a VNET or on premises.

Travis Roberts has a nice video on just this topic that gave me the answers I needed.

Below is my parent runbook.

# Ensures you do not inherit an AzureRMContext in your runbook
Disable-AzureRmContextAutosave –Scope Process

$connection = Get-AutomationConnection -Name AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -Tenant $connection.TenantID `
-ApplicationID $connection.ApplicationID -CertificateThumbprint $connection.CertificateThumbprint

$rgName='MyResourceGroup'
$vmName='MyVM'
$SubID = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'

$AzureContext = Select-AzureRmSubscription -SubscriptionId $SubID
'Check if VM is on'
$vm=((Get-AzureRmVM -ResourceGroupName $rgName -AzureRmContext $AzureContext -Name $vmName -Status).Statuses[1]).Code
 $vm 
 if ($vm -eq 'PowerState/running')
 {
    Do 
    {
        #if VM running call other runbook
        start-sleep -Seconds 60;
        'Check if job is running'
        $JobRunning = start-azureRMautomationrunbook -AutomationAccount 'ProgramsAutomation' -Name 'CheckRunningSQLJob' -ResourceGroupName $rgName -AzureRMContext $AzureContext -Runon 'Backups' -Wait;
        Write-Output $JobRunning
        
        
    } Until ($JobRunning -eq 'run0')
    
    'Stopping VM'
    stop-azurermvm -Name $VMname -ResourceGroupName $RgName -force
}

The runbook sets the Azure context to the appropriate subscription (especially important when you are a guest user in someone else’s tenant). Then it checks if the VM is started. If it is, it goes into a do-while loop. This task isn’t super time sensitive (it’s just to save money when the VM isn’t in use), so it’s waiting 60 seconds and then calling the child runbook to find out if my SQL Agent job is running. This makes sure that the child runbook is called at least once. If the result is that the job is not running, it stops the VM. If the job is running, the loop starts over, waiting 60 seconds before checking again. This loop is essentially polling the job status until it sees that the job is completed. One thing to note is the -Wait parameter on the end of that Start-AzureRmAutomationRunbook command. If you don’t specify the -Wait parameter, the command will immediately return a job object. If you specify the -Wait parameter, it waits for that child job to complete and returns the results of that job.

And here is my child runbook.

[OutputType([string])]

$SQLJobName = 'MySQLAgentJobName'
$SQLInstanceName = 'MySQLServer

$cred=Get-AutomationPSCredential -Name 'mycredential'
 
$server = Connect-DbaInstance -SqlInstance $SQLInstanceName -SqlCredential $cred
 
Get-DbaRunningJob -SqlInstance $server | Get-DbaRunningJob

$JobStatus = (Get-DbaRunningJob -SqlInstance $server).Name -match $SQLJobName

If ($JobStatus -ne $false) 
{
#job is running. Passing back a string because bits and ints were causing issues.
    $JobRunning = 'run1'
    Write-Output $JobRunning 
}
else 
{ 
#job is idle
    $JobRunning = 'run0'
    Write-Output $JobRunning 
}

I’m using dbatools to check if the job is running on the server. That is the Get-DBARunning Job command. The important part to note is that you have to use the Write-Output command for this output to be available to the parent runbook. I got some weird results when I tried to return an int or a boolean (it was returning an object rather than a single value), so I just went with a string. The string, while not the most efficient, works just fine. If you understand why this is, feel free to leave me a comment.

These runbooks have been in place for a couple of months now, and they are working great to shut down the VM to save money while making sure not to disturb an important SQL Agent job that might occasionally run late. I didn’t find much documentation nor many examples of using output from a child job that runs on a hybrid worker, so I wanted to get this published to help others that go searching.

Azure, Azure SQL DW, Data Warehousing, Microsoft Technologies, T-SQL

What You Need to Know About Data Classifications in Azure SQL Data Warehouse

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.

ADD SENSITIVITY CLASSIFICATION TO
    dbo.DimCustomer.Phone
    WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')

To view existing data classifications, you can query the sys.sensitivity_classifications view or look in the Azure Portal.

SELECT
sys.all_objects.name as [TableName], 
sys.all_columns.name as [ColumnName],
[Label], 
[Information_Type], 
FROM sys.sensitivity_classifications
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
    and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id

Be Careful When Loading With CTAS and Rename

One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:

  1. Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
  2. Rename the dimension table to something like Dimension_OLD
  3. Rename the upsert table to Dimension
  4. Drop the Dimension_OLD table
Animation of a table load process in Azure SQL DW


In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.

Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.

There are a few complicating factors:

  • There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
  • ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
  • The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
  • SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.

The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.

Update: Someone asked about the name SwapWithMetadata and why it doesn’t specifically mention sensitivity classifications. I didn’t mention classifications because there are other things that need this same treatment. Dynamic data masking will also need to be reapplied. With dynamic data masking, it will be even more important to add it back immediately after swapping the tables rather than waiting for a full data load of all selected tables to finish and adding all classifications back. If your load takes a long time or the process fails on another table, you don’t want your data exposed without a mask to users who shouldn’t see the full information.

CREATE PROC SwapWithMetadata
@SrcSchema NVARCHAR(128),
@SrcTable NVARCHAR(128),
@DestSchema NVARCHAR(128),
@DestTable NVARCHAR(128),
@TransferMetadata BIT,
@DropOldTable BIT
AS
BEGIN
SET NOCOUNT ON
BEGIN TRY
–Check if destination table exists
DECLARE @DestSchemaQualifiedTableName NVARCHAR(257)
SET @DestSchemaQualifiedTableName = @DestSchema + '.' + @DestTable
IF OBJECT_ID(@DestSchemaQualifiedTableName) IS NULL
BEGIN
DECLARE @DestErr NVARCHAR(MAX)
SET @DestErr = 'Table ' + @DestSchemaQualifiedTableName + ' not found'
RAISERROR(@DestErr, 15, 1)
END
–Check if source table exists
DECLARE @SrcSchemaQualifiedTableName NVARCHAR(257)
SET @SrcSchemaQualifiedTableName = @SrcSchema + '.' + @SrcTable
IF OBJECT_ID(@SrcSchemaQualifiedTableName) IS NULL
BEGIN
DECLARE @SrcErr NVARCHAR(MAX)
SET @SrcErr = 'Table ' + @SrcSchemaQualifiedTableName + ' not found'
RAISERROR(@SrcErr, 15, 1)
END
–Move destination table to destination_old. Move source table to destination
DECLARE @RenameSql NVARCHAR(MAX)
SET @RenameSql = 'RENAME OBJECT ' + @DestSchemaQualifiedTableName + ' TO ' + @DestTable + '_old; '
Set @RenameSql = @RenameSql + ' RENAME OBJECT ' + @SrcSchemaQualifiedTableName + ' TO ' + @DestTable
PRINT 'Executing ' + @RenameSql + ' …'
EXEC sp_executesql @RenameSql;
–drop temp table if it exists
IF OBJECT_ID('tempDB..#tempApplySensitivityClassificationsToTable') IS NOT NULL
DROP TABLE #tempApplySensitivityClassificationsToTable;
–check if we should transfer data classifications from old to new table
IF ISNULL(@TransferMetadata,0) = 1
BEGIN
–put current classifications in a temp table
DECLARE @OldTable NVARCHAR(128) = @DestTable + '_old';
WITH CurrentClassifications as (
SELECT
CAST('dbo' as NVARCHAR(128)) [Schema],
CAST(sys.all_objects.name as NVARCHAR(128)) [Table],
CAST(sys.all_columns.name as NVARCHAR(128)) [Column],
CAST([Information_Type] as NVARCHAR(128)) [Informationtype],
CAST([Label] as NVARCHAR(128)) [Label]
FROM
sys.sensitivity_classifications
LEFT OUTER JOIN sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
LEFT OUTER JOIN sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id
)
SELECT ROW_NUMBER() OVER (ORDER BY [Schema], [Table], [Column]) [ID],
[Schema], [Table], [Column], [Informationtype], [Label]
INTO #tempApplySensitivityClassificationsToTable
FROM CurrentClassifications
WHERE [Schema] = @DestSchema AND [Table] = @OldTable;
DECLARE @i INT
SET @i = 1
DECLARE @Max INT
SELECT @Max = COUNT(*)
FROM #tempApplySensitivityClassificationsToTable;
PRINT 'Transferring ' + CAST(@Max as VARCHAR(4)) + ' classifications'
–drop and recreate sensitivity classifications
DECLARE @Sql NVARCHAR(MAX)
DECLARE @Col NVARCHAR(128)
DECLARE @InfoType NVARCHAR(128)
DECLARE @Label NVARCHAR(128)
WHILE @i <= @Max
BEGIN
SELECT @Col = [Column], @InfoType = [InformationType], @Label = [Label]
FROM #tempApplySensitivityClassificationsToTable
WHERE Id = @i
SET @Sql = 'DROP SENSITIVITY CLASSIFICATION FROM ' + @DestSchemaQualifiedTableName + '.' + @Col
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
SET @Sql = 'ADD SENSITIVITY CLASSIFICATION TO ' + @DestSchemaQualifiedTableName + '.' + @Col
IF (@InfoType IS NOT NULL AND @Label IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (LABEL=''' + @Label + ''', INFORMATION_TYPE=''' + @InfoType + ''')'
END
ELSE IF (@InfoType IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (INFORMATION_TYPE=''' + @InfoType + ''')'
END
ELSE IF (@Label IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (LABEL=''' + @InfoType + ''')'
END
ELSE
BEGIN
SET @Sql = NULL
END
IF (@Sql IS NOT NULL)
BEGIN
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
END
SET @i = @i + 1
END
END
IF ISNULL(@DropOldTable,0) = 1
BEGIN
DECLARE @DropSql NVARCHAR(MAX)
SET @DropSql = 'DROP TABLE ' + @DestSchemaQualifiedTableName + '_old;'
PRINT 'Executing ' + @DropSql + '…'
EXEC sp_executesql @DropSql;
END
END TRY
BEGIN CATCH
Print 'ERROR… Procedure: ' + ERROR_PROCEDURE() + ' Message: ' + ERROR_MESSAGE()
END CATCH
END

Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.

For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.

Vote for My Enhancement Idea

If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.

Not an Issue with Other Data Load Methods

Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.

Give Data Classifications a Try

I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.

Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?

Azure, Azure Data Factory, Microsoft Technologies

There is Now A Delete Activity in Data Factory V2!

Data Factory can be a great tool for cloud and hybrid data integration. But since its inception, it was less than straightforward how we should move data (copy to another location and delete the original copy).

It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.

But now Data Factory V2 has a Delete activity.

Data Factory V2 Copy activity followed by a Delete activity

How the Delete Activity Works

The Delete activity can delete from the following data stores:

  • Azure blob storage
  • ADLS Gen 1
  • ADLS Gen 2
  • File systems
  • FTP
  • SFTP
  • Amazon S3

You can delete files or folders. You can also specifiy whether you want to delete recursively (delete including all subfolders of the specified folder). If you want to delete files/folders from a file system on a private network (e.g., on premises), you will need to use a self-hosted integration runtime running version 3.14 or higher. Data Factory will need write access to your data store in order to perform the delete.

You can log the deleted file names as part of the Delete activity. It requires you to provide a blob storage or ADLS Gen 1 or 2 account as a place to write the logs.

You can parameterize the following properties in the Delete activity itself:

  • Timeout
  • Retry
  • Delete file recursively
  • Max concurrent connections
  • Enable Logging
  • Logging folder path

You can also parameterize your dataset as usual.

All that’s required in the Delete activity is an activity name and dataset. The other properties are optional. Just be sure you have specified the appropriate file path. Maybe try this out in dev before you accidentally delete your way through prod.

To delete all contents of a folder (including subfolders), specify the folder path in your dataset and leave the file name blank, then check the box for “Delete file recursively”.

You can use a wildcard (*) to specify files, but it cannot be used for folders.

Here’s to much more efficient development of data movement pipelines in Azure Data Factory in V2.

Azure, Azure Data Factory, Microsoft Technologies, Power BI

How Many Data Gateways Does My Azure BI Architecture Need?

Computer with lock protecting data

It’s not always obvious when you need a data gateway in Azure, and not all gateways are labeled as such. So I thought I would walk through various applications that act as a data gateway and discuss when, where, and how many are needed.

Note: I’m ignoring VPN gateways and application gateways for the rest of this post. I’m assuming your networking/VPN situation is fixed at this point and working from there.

Let’s start with what services may require you to use a data gateway.

You will need a data gateway when you are using Power BI, Azure Analysis Services, PowerApps, Microsoft Flow, Azure Logic Apps, Azure Data Factory, or Azure ML with a data source/destination that is in a private network. Note that a private network includes on-premises data sources and Azure Virtual Machines as well as Azure SQL Databases and Azure SQL Data Warehouses that require use of VNet service endpoints rather than public endpoints.  

Luckily, many of these services can use the same data gateway. Power BI, Azure Analysis Services, PowerApps, Microsoft Flow, and Logic Apps all use the On Premises Data Gateway. Azure Data Factory (V1 and V2) and Azure Machine Learning Studio use the Data Factory Self-Hosted Integration Runtime.

On Premises Data Gateway (Power BI et al.)

If you are using one or more of the following:

  • Power BI
  • Azure Analysis Services
  • PowerApps
  • Microsoft Flow
  • Logic Apps

and you have a data source in a private network, you need at least one gateway. But there are a few considerations that might cause you to set up more gateways.

  1. Your services must be in the same region to use the same gateway. This means that your Power BI/Office 365 region and Azure region for your Azure Analysis Services resource must match for them to all use one gateway.  If you have resources in different regions, you will need one gateway per region.
  2. You may want high availability for your gateway. You can create high availability clusters so when one gateway is down, traffic is rerouted to another available gateway in the cluster.
  3. You may want to segment traffic to ensure the necessary resources for certain ad hoc live/direct queries or scheduled refreshes. If your usage and refresh patterns warrant it, you may want to set up one gateway for scheduled refreshes and one gateway for live/direct queries back to any on-premises data sources. Or you might make sure live/direct queries for two different high-traffic models go through different gateways so as not to block each other. This isn’t always warranted, but it can be a good strategy.

Data Factory Self-hosted Integration Runtime

If you are using Azure Data Factory (V1 or V2) or Azure ML with a data source in a private network, you will need at least one gateway. But that gateway is called a Self-hosted Integration Runtime (IR).

Self-hosted IRs can be shared across data factories in the same Azure Active Directory tenant. They can be associated with up to four machines to scale out or provide higher availability. So while you may only need one node, you might want a second so that your IR is not the single point of failure.

Or you may want multiple IRs to boost throughput of copy activities. For instance, copying from an on-premises file server with one IR node is about 195 Megabytes per second (MB/s). But with 4 IR nodes, it can be as fast as 505 MB/s.

Factors that Affect the Number of Data Gateways Needed

The main factors determining the number of gateways you need are:

  1. Number of data sources in private networks (including Azure VNets)
  2. Location of services in Azure and O365 (number of regions and tenants)
  3. Desire for high availability
  4. Desire for increased throughput or segmented traffic

If you are importing your data to Azure and using an Azure SQL DB with no VNet as the source for your Power BI model, you won’t need an On Premises Data Gateway. If you used Data Factory to copy your data from an on-premises SQL Server to Azure Data Lake and then Azure SQL DB, you need a Self-Hosted Integration Runtime.

If all your source data is already in Azure, and your source for Power BI or Azure Analysis Services is Azure SQL DW on a VNet, you will need at least one On-Premises Data Gateway.

If you import a lot of data to Azure every day using Data Factory, and you land that data to Azure SQL DW on a VNet, then use Azure Analysis Services as the data source for Power BI reports, you might want a self-hosted integration runtime with a few nodes and a couple of on-premises gateways clustered for high availability.

Have a Plan For Your Gateways

The gateways/integration runtimes are not hard to install. They are just often not considered, and projects get stalled waiting until a machine is provisioned to install them on. And many people forget to plan for high availability in their gateways. Make sure you have the right number of gateways and IR nodes to get your desired features and connectivity. You can add gateways/nodes later, but you don’t want to get caught with no high availability when it really matters.

Azure, Microsoft Technologies

The Necessary Extras That Aren’t Shown in Your Azure BI Architecture Diagram

When we talk about Azure architectures for data warehousing or analytics, we usually show a diagram that looks like the below.

Modern DW Architecture
Modern DW Architecture https://azure.microsoft.com/en-us/solutions/architecture/modern-data-warehouse/

This diagram is a great start to explain what services will be used in Azure to build out a solution or platform. But many times, we add the specific resource names and stop there. If you have built several projects in Azure, you will know there some other things for which you will need to plan. So what’s missing?

Azure Active Directory

Let’s start with Azure Active Directory (AAD). In order to provision the resources in the diagram, your Azure subscription must already be associated with an Active Directory. AAD is Microsoft’s cloud-based identity and access management service. Members of an organization have a user account that can sign in to various services. AAD is used to access Office 365, Power BI, and Dynamics 365, as well as the Azure portal. It can also be used to grant access and permissions to specific Azure resources.

For instance, users who will participate in Data Factory pipeline development must be assigned to the Data Factory Contributor role in Azure. Users can authenticate via AAD to log in to Azure SQL DW. You can use AD Groups to grant permissions to users interacting with various Azure resources such as Azure SQL DW or SQL DB as well as Azure Analysis Services and Power BI. It’s more efficient to manage permissions for groups than to do it for each user. You’ll need a plan to manage group membership.

You may also need to register applications in Azure Active Directory to allow Azure services to authenticate and interact with each other. While the guidance for many services is now to use managed identities, this may not be available for every situation just yet.  

If your organization has some infrastructure on premises, it is likely that they have Active Directory on premises as well. So you will want to make sure you have a solution in place to sync your on-premises and Azure Active Directory.

Networking

Virtual Networks (or VNets) allow many types of Azure resources to securely communicate with each other, the internet, and on-premises networks. You can have multiple virtual networks in an Azure subscription. Each virtual network is isolated from other VNets, unless you set up VNet peering.

Some PaaS services such as Azure Storage Accounts, Azure SQL DW, and Azure Analysis Services support Virtual Network Service Endpoints. A common usage scenario is to set up VNets and VNet Service Endpoints to connect resources to on-premises networks. Some organizations prefer to use VNet Service Endpoints instead of public service endpoints, making it so that traffic can only access the resource from within the organization’s local network.

In order to connect a VNet to an on-premises network or another VNet (outside of peering), you’ll need a VPN Gateway. You’ll need to identify the most appropriate type of VPN Gateway: Point-to-Site, Site-to-Site, or Express Route. These offerings differ based on bandwidth, protocols supported, routing, connection resiliency, and SLAs. Pricing can vary greatly based upon your gateway type.

While VNets and VPN Gateways are probably the most common networking resources in Azure, there are many other networking services and related design decisions to consider as you plan an Azure deployment.

Data Gateways

Your BI solution may be entirely in Azure, but if you need to retrieve data from data sources in a private network (on premises or on a VNet), you’ll need a gateway. If you are using Azure Data Factory, you’ll need a Self-hosted Integration Runtime (IR). If the source for your Power BI or Azure Analysis Services model is on a private network, you’ll need an On-Premises Data Gateway. You can use the same gateway to connect to Analysis Services, Power BI, Logic Apps, Power Apps, and Flow. If you will have a mix of Analysis Services and Power BI models sharing the same gateway, you’ll need to make sure that your Power BI region and your AAS region match.

These gateways require that you have a machine inside the private network on which to install them. And if you want to scale out or have failover capabilities, you may need multiple gateways and multiple VMs. So while you may be building a solution in Azure, you might end up with a few on-premises VMs to be able to securely move source data.

Dev and Test Environments

Our nice and tidy diagram above is only showing production. We also need at least a development environment and maybe one or more test environments. You’ll need to decide how to design your dev/test environments. You may want duplicate resources in a separate resource group; e.g. a dev resource group that contains a dev Data Factory, a dev Azure SQL DW, a dev Azure Analysis Services, etc. While separating environments by resource group is common, it’s not your only option. You will need to decide if you prefer to separate environments by resource group, subscription, directory, or some combination of the three.

ARM templates and PowerShell are very useful for deploying updates and creating new environments. Also, take a look at Azure Blueprints.

You’ll also want to investigate ways to keep the costs of non-prod environments down via smaller-sized resources or pausing or deleting resources where applicable.

Plan For, Don’t Worry About, the Extras

There are several other ancillary Azure services that could/should be part of your solution.

  • For source control, GitHub and Azure Repos have the easiest integration, especially with Azure Data Factory. You’ll not only want source control for things like database projects and Data Factory pipelines, but also possibly for ARM templates and PowerShell scripts used to deploy resources (think: infrastructure as code).
  • You’ll want to set up Azure Monitoring, including alerts to let you know when processes and services are not running as intended.
  • If you want more cost management support than just setting a spending limit on a subscription (if it is available for your subscription type), it may be helpful to set budgets in Azure so you can be notified when you hit certain percentages or amounts.

Be sure to think through the entire data/solution management lifecycle. You may want to include extra services for cataloging, governing, and archiving your data.

This may sound like a complex list, but these resources and services are fairly easy to work with. Azure Active Directory has a user-friendly GUI in the portal, and commands can be automated with PowerShell, requiring relatively little effort to manage users and groups to support your BI environment. VNets and VPN Gateways are a little more complex, but there are step-by-step instructions available for many tasks in the Microsoft Docs. The Power BI Gateway and ADF IR have quick and easy GUI installers that just take a few minutes. You can automate Azure deployments with Azure Pipelines or PowerShell scripts.

None of these things are really that awful to implement or manage in Azure, unless you weren’t aware of them and your project is paused until you can get them set up.

Is there anything else you find is commonly left out when planning for BI solutions in Azure? Leave me a comment if you would like to add to the list.

Update (1/26/19): Helpful readers have commented on other aspects of your Azure BI architecture they felt were often overlooked:

  • Make sure you have a plan for how to process your Azure Analysis Services model. Will you use Azure Automation? Call an Azure Function from Data Factory?
  • Be sure to organize and tag your resources appropriately to help you understand and control costs.
  • Don’t forget Azure Key Vault. This will help you keep keys and passwords secure.

(Thanks to Chad Toney, Shannon Holck, and Santiago Cepas for these suggestions.)