Accessibility, Data Visualization, Power BI

Viridis color palettes in Power BI theme files

I am a fan of the viridis color palettes available in python and R, so I decided to make Power BI theme files for each of the 4 color maps (viridis, inferno, magma, plasma). These color palettes are not only lovely to look at, they are colorblind/CVD friendly and perceptually uniform (or close to it).

The screenshots below show the colors you’ll get when you use my theme files.

Viridis

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the viridis color palette which range from dark purple to blue to green.
Viridis theme colors in Power BI

Plasma

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the plasma color palette which range from dark purple to pink to yellow.
Plasma theme colors in Power BI

Magma

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the magma color palette which range from dark purple to pink to orange.
Magma theme colors in Power BI

Inferno

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the inferno color palette which range from dark purple to red to yellow
Inferno theme colors in Power BI

I generated a palette of 10 colors and then dropped the darkest and lightest colors in an effort to try to help you get good color contrast without inadvertently highlighting a data point. I chose to use the second darkest color of the 8 as the first/main color, which should work well on light backgrounds.

You’ll also notice that I have set in the theme the minimum, center, and maximum colors for use in a diverging color palette. This diverging palette includes the darkest and lightest color in an effort to give you a wider scale.

Give the themes a try

If you don’t enjoy choosing colors and just want something that looks good, feel free to hop over to the Github project and download the JSON files. You can learn more about the method I used to choose the colors and my suggestions for usage in the project documentation.

If you do use the themes, feel free to let me know how they worked and if you have suggestions for improvements.

Microsoft Technologies, Power BI, Power Query

Calling the Intercom API with Power Query and Refreshing in the Power BI Service

I needed to pull some user data for an app that uses Intercom. While I will probably import the data using Data Factory or a function in the long term, I needed to pull some quick data in a refreshable manner to combine with other data already available in Power BI.

I faced two challenges in getting this code to work:

  1. Intercom’s API uses cursor-based pagination when retrieving contacts
  2. I needed this query to be refreshable in PowerBI.com so I could schedule a daily refresh.

If you have worked with the Web.Contents function in Power Query, you may be familiar with all the various ways you can use it that aren’t supported in a refresh on PowerBI.com. Chris Webb’s blog is a great source for this info, if you find yourself stuck with a query that works in Power BI Desktop but not in the service.

The Intercom API

In version 2.4 of the Intercom API, users are a type of contact. You must make an HTTP GET call to https://api.intercom.io/contacts and pass an access token and Accept:application/json in the headers.

In the query string, you can specify the number of contacts per page by specifying per_page=x where x is a number less than or equal to 150. If you have more than 150 contacts, you will need to handle the cursor-based pagination.

When you make the initial call, the API will return a JSON object that contains a total count of contacts and a record for pages. Expanding the pages record shows the current page, the number of contacts per page, and the total number of pages.

type: pages
next: Record
page: 1
per_page: 150
total_pages: 3
Data returned in the Power Query Editor when retrieving contacts from the Intercom API

Expanding the next record gives you page 2 with a starting_after ID.

The cursor used for pagination in the Intercom API as retrieved using Power Query

To get the next page of contacts, the API call would be https://api.intercom.io/contacts?per_page=150&starting_after=[the starting_after ID listed above].

The Power Query Queries

This blog post from Gil Raviv gave me some ideas where to start, but the code in that blog will not refresh in PowerBI.com. You cannot put the iterations and the Web.Contents call and the generated list all in one query if you want to use scheduled refresh.

I ended up creating one query and two functions to accomplish my goal. The second function is optional, but you may find it useful as the time values in the API response are listed as Unix timestamps and you probably want to convert them to datetime values.

The first function contains the API call with an input parameter for the starting_after ID.

//Web API call function
(startafter) as record =>
   let
   Source = Json.Document(Web.Contents("https://api.intercom.io/contacts",[Query=[per_page="150",starting_after=startafter],Headers=[Accept="application/json", Authorization="Bearer <your access token goes here>"]])),
   data = try Source[data] otherwise null,
    pages = Source[pages],
    ttlpages = pages[total_pages],
    nextkey = pages[next][starting_after],
    next = try nextkey otherwise null,
    res = [Data=data, Next=next,TotalPages = total_pages]
   in
    res

It’s important to make the url in the Web.Contents function be a static string. If it is concatenated or dynamic in any way, the query will not be able to refresh in the Power BI service. All the query string parameters can go in the Query arguments of the Web.Contents function. If you have multiple query string arguments, you can put them in brackets with a comma separating them. You can do the same with multiple headers.

This function attempts the API call and returns null if it encounters an error. Once the data is returned, it retrieves the specific JSON objects needed to return the data. The next two lines are used to retrieve the starting_after value. The function returns the contact data, starting_after value, and total_pages value.

I used the following query to call that function.

let
GeneratedList = List.Generate(
   ()=>[i=0, res = fnGetOnePage("")],
   each [res][Data]<>null,
   each [i=[i]+1, res = fnGetOnePage([res][Next])],
   each [res][Data]),
    #"Converted to Table" = Table.FromList(GeneratedList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"),
    #"Expanded Column2" = Table.ExpandRecordColumn(#"Expanded Column1", "Column1", {"id", "workspace_id", "external_id", "role", "email", "name", "has_hard_bounced", "marked_email_as_spam", "unsubscribed_from_emails", "created_at", "updated_at", "signed_up_at", "last_seen_at", "last_replied_at", "last_contacted_at", "last_email_opened_at", "last_email_clicked_at", "browser", "browser_version", "browser_language", "os", "location", "custom_attributes", "tags", "notes", "companies", "opted_out_subscription_types"}, {"id", "workspace_id", "external_id", "role", "email", "name", "has_hard_bounced", "marked_email_as_spam", "unsubscribed_from_emails", "created_at", "updated_at", "signed_up_at", "last_seen_at", "last_replied_at", "last_contacted_at", "last_email_opened_at", "last_email_clicked_at", "browser", "browser_version", "browser_language", "os", "location", "custom_attributes", "tags", "notes", "companies", "opted_out_subscription_types"}),
    #"Expanded location" = Table.ExpandRecordColumn(#"Expanded Column2", "location", {"type", "country", "region", "city", "country_code"}, {"location.type", "location.country", "location.region", "location.city", "location.country_code"}),
    #"Expanded custom_attributes" = Table.ExpandRecordColumn(#"Expanded location", "custom_attributes", {"role", "brand", "Subdomain"}, {"custom_attributes.role", "custom_attributes.brand", "custom_attributes.Subdomain"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded custom_attributes",{"opted_out_subscription_types", "tags", "notes", "companies", "role"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Columns",{{"id", type text}, {"workspace_id", type text}, {"external_id", type text}, {"email", type text}, {"name", type text}, {"has_hard_bounced", type logical}, {"marked_email_as_spam", type logical}, {"unsubscribed_from_emails", type logical}, {"created_at", Int64.Type}, {"updated_at", Int64.Type}, {"signed_up_at", Int64.Type}, {"last_seen_at", Int64.Type}, {"last_replied_at", Int64.Type}, {"last_contacted_at", Int64.Type}, {"last_email_opened_at", Int64.Type}, {"last_email_clicked_at", Int64.Type}}),
    #"Replace Null with 0" = Table.ReplaceValue(#"Changed Type",null,0,Replacer.ReplaceValue,{"created_at","updated_at","signed_up_at","last_replied_at","last_seen_at","last_contacted_at","last_email_opened_at","last_email_clicked_at"
}),
    #"Invoked Custom Function" = Table.AddColumn(#"Replace Null with 0", "created_at_dt", each fnUnixToDateTime([created_at])),
    #"Invoked Custom Function1" = Table.AddColumn(#"Invoked Custom Function", "updated_at_dt", each fnUnixToDateTime([updated_at])),
    #"Invoked Custom Function2" = Table.AddColumn(#"Invoked Custom Function1", "signed_up_at_dt", each fnUnixToDateTime([signed_up_at])),
    #"Invoked Custom Function3" = Table.AddColumn(#"Invoked Custom Function2", "last_seen_at_dt", each fnUnixToDateTime([last_seen_at])),
    #"Invoked Custom Function4" = Table.AddColumn(#"Invoked Custom Function3", "last_replied_at_dt", each fnUnixToDateTime([last_replied_at])),
    #"Invoked Custom Function5" = Table.AddColumn(#"Invoked Custom Function4", "last_contacted_at_dt", each fnUnixToDateTime([last_contacted_at])),
    #"Invoked Custom Function6" = Table.AddColumn(#"Invoked Custom Function5", "last_email_opened_at_dt", each fnUnixToDateTime([last_email_opened_at])),
    #"Invoked Custom Function7" = Table.AddColumn(#"Invoked Custom Function6", "last_email_clicked_at_dt", each fnUnixToDateTime([last_email_clicked_at])),
    #"Fix Null Values" = Table.ReplaceValue(#"Invoked Custom Function7",DateTime.From("1970-01-01"),null,Replacer.ReplaceValue,{"created_at_dt","updated_at_dt","signed_up_at_dt","last_replied_at_dt","last_seen_at_dt","last_contacted_at_dt","last_email_opened_at_dt","last_email_clicked_at_dt"
}),
    #"Removed Columns1" = Table.RemoveColumns(#"Fix Null Values",{"created_at", "updated_at", "signed_up_at", "last_seen_at", "last_replied_at", "last_contacted_at", "last_email_opened_at", "last_email_clicked_at"})
in
    #"Removed Columns1"

The List.Generate call generates the rows I need to call my first function. It sets an iterator variable to 0 and then calls the function, which returns the first page of data along with the total pages and the starting_after ID. As long as data is returned, it makes the API call again with the previously returned starting_after ID. This creates a list of lists that can be converted into a table of records. Then the records can be expanded to fields.

I expanded several columns out into multiple columns. Then I adjusted the data types of my columns to the correct types (they came back as Any data type).

There were several columns that contained Unix timestamps. All of the custom function calls are returning the datetime version of those values. I needed to handle null values in the timestamp conversion, so I replaced all the null timestamps with 0, converted them, and then converted the datetime value of 1-Jan-1970 back to null. I did the replace for all 7 columns in one step. Then I removed the original columns that contained the Unix timestamp as they were not analytically relevant for me.

Below is my Unix timestamp to datetime conversion function.

(UnixTime as number) as datetime=> 
let 
DT = #datetime(1970, 1, 1, 0, 0, 0) + #duration(0, 0, 0, UnixTime) 
in DT

Advice and warnings

When using an API key that must be passed in the headers, it is safest to use a custom connector. Otherwise, you have to embed your API key in the M code, as shown above. When the query is sent to Intercom, it is encrypted using HTTPS. But anyone that opens your PBIX file would have access to it. Your key will be captured in the Power BI logs. And anyone that can manage to intercept your web request and decrypt your traffic would have access to it. This is not ideal. But creating a custom connector requires more advanced code and a gateway to make it usable in the Power BI service. With either option, you will choose Anonymous authentication for the data source.

Be sure to use the RelativePath and Query options in the Web.Contents call. This is necessary to make the query refreshable in the service. The value passed to the first parameter of Web.Contents must be a static string and must be valid in itself (no errors returned).

After publishing your report, you’ll need to set the credentials for the dataset before you can refresh it. Be sure to check the Skip test connection box. Otherwise, your credentials update will fail.

url: https://api.intercom.io/contacts
Authentication method: anonymous
Privacy level: organizational 
Skip test connection: yes
The Skip test connection option for the web data source in the dataset settings in PowerBI.com

Even though we are using anonymous authentication, you can still choose an Organizational privacy level.

In my contacts list, if I want only app users from your Intercom contacts list, I needed to filter the results on role = “user” since contacts also includes leads. The Role attribute was in the custom attributes returned by the API.

It took a bit of experimentation to get this working, so I thought I would share what I found to work. As always, Power BI is always changing and a better way to do this may be available in the future.

If you would like this development experience to improve, here are some Power BI Ideas to vote for:

Azure, Microsoft Technologies, SQL Server

Check if File Exists Before Deploying SQL Script to Azure SQL Managed Instance in Azure Release Pipelines

I have been in Azure DevOps pipelines a lot recently, helping clients set up automated releases. Many of my clients are not in a place where automated build and deploy of their SQL databases makes sense, so they deploy using change scripts that are reviewed before deployment.

We chose release pipelines over the YAML pipelines because it was easier to manage for the client and pretty quick to set up. While I had done this before, I had a couple of new challenges:

  • I was deploying to an Azure SQL managed instance that had no public endpoint.
  • There were multiple databases for which there may or may not be a change script to execute in each release.

This took a bit longer than I expected, and I enlisted my coworker Bill to help me work through the final details.

Setup

There are a few pre-requisites to make this work.

Self-hosted Agent

First, you need to run the job on a self-hosted agent that is located on a virtual machine that has access to the managed instance. This requires downloading and installing the agent on the VM. We ended up installing this on the same VM where we have our self-hosted integration runtime for Azure Data Factory, at least for deployment from dev to test. This was ok because the VM had enough resources to support both, and deployments are not long or frequent.

When you create a new self-hosted agent in Azure DevOps, a modal dialog appears, which provides you with a link to download the agent and a command line script to install and configure the agent.

Dialog from Azure DevOps titled "Get the agent". There are options for Windows, macOS, and Linux. There is a button to copy a link to download the agent. Then there is a set of command line scripts to create the agent, and configure the agent.
Dialog showing the instructions to download and install the Azure Pipelines Self-hosted agent

We configured the agent to run as a service, allowing it to log on as Network Service. We then could validate it was running by opening the Services window and looking for Azure Pipelines Agent.

You must add an inbound firewall rule for the VM to allow the traffic from the IP addresses related to dev.azure.com. This can be done in the Azure portal on the Networking settings for the VM.

Then you must install the SqlServer module for PowerShell on the VM. And you must also install sqlpackage on the VM. We also updated the System Path variable to include the path to sqlpackage (which currently gets installed at C:\Program Files\Microsoft SQL Server\160\DAC\bin).

DevOps Permissions

Azure DevOps has a build agent with a service principal for each project. That service principal must have Contribute permissions on the repo in order to follow my steps.

Database Permissions

If you want the Azure DevOps build agent to execute a SQL script, the service principal must have appropriate permissions on the database(s). The service principal will likely need to create objects, execute stored procedures, write data, and read data. But this depends on the content of the change scripts you want to execute.

The Pipeline

In this project we had one pipeline to release some analytics databases and Azure Data Factory assets from dev to test.

Therefore, we included the latest version of the repos containing our database projects and data factory resources as artifacts in our pipeline. While the database objects and data factory resources were related, we decided they weren’t dependent upon each other. It was ok that one might fail and one succeed.

Azure DevOps release pipeline with 2 artifacts (database code and ADF code) and 2 stages (Database deployment and ADF Deployment).

Side note: I love the ADF Tools extension for Azure DevOps from SQL Player. It is currently my preferred way to deploy Azure Data Factory. I find it is a much better experience than ARM template deployment. So that is what I used here in this ADF Deploy job.

Repo Setup

Getting back to databases, we had 3 databases to which we might want to deploy changes. We decided that we would set a standard of having one change script per database. We have an Azure DevOps repo that contains one solution with 3 database projects. These databases are highly related to each other, so the decision was made to keep them all in one repo. I added a folder to the repo called ChangeScripts. ChangeScripts contains two subfolders: Current and Previous. When we are ready to deploy database changes, we add a .sql file with the name of the database into the Current folder. For any deployment, there may be 1, 2, or 3 change scripts (1 for each database that has changes). When the release pipeline runs, the change scripts in the Current folder are executed and the files are renamed and moved into the Previous folder. This is why my DevOps service principal needed Contribute permissions on the repo.

Here are the tasks in my Database Deployment job:

  1. Check if SQL Scripts exist in the expected location.
  2. If a script exists for Database1, execute the script.
  3. If a script exists for Database 2, execute the script.
  4. If a script exists for Database 3, execute the script.
  5. Rename the change scripts to fit the pattern [DatabaseName][yyyyMMddHHmm].sql and move them to the Previous folder.

Configuring the Tasks

The first thing we must do is set the agent job to run using the agent pool containing the self-hosted agent. There is another important setting under Additional options for Allow scripts to access the OAuth token. This allows that service principal to authenticate and execute the SQL scripts, so make sure that is checked.

My first task, called Check for SQL Scripts is a PowerShell task. I used an inline script, but you can also reference a file path that is accessible by the agent. My script looks for the three files and then sets a pipeline variable for as true or false to indicate the presence each file. This variable is used later in the process.

My task is configured with the working directory set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current. There are no environment variables or output variables.

Here’s my PowerShell script:

$DB1Script = "Database1.sql"
$DB2Script = "Database2.sql"
$DB3Script = "Database3.sql"

$IsDB1Script = Test-Path -Path $DB1Script -PathType Leaf
$IsDB2Script =  Test-Path -Path $DB2Script -PathType Leaf
$IsDB3Script = Test-Path -Path $DB3Script -PathType Leaf

Write-Host $IsDB1Script  $IsDB2Script $IsDB3Script

Write-Output ("##vso[task.setvariable variable=DB1Script;]$IsDB1Script")

Write-Output ("##vso[task.setvariable variable=DB2Script;]$IsDB2Script")

Write-Output ("##vso[task.setvariable variable=DB3Script;]$IsDB3Script")

We got stuck on the syntax for setting variables for a while. Most of the things we found in Microsoft Docs and on blogs were close, but didn’t work. Finally, we found this blog, which showed us the syntax that worked – be careful with the quotes, semi-colon, and parentheses.

My second task executes the script for Database1. It uses the Azure SQL Database deployment task. When you use this task, you can specify whether it deploys a dacpac or executes a SQL script. We chose to execute a SQL script that we stored in the same repo as the database projects. For this task, you must specify the Azure Subscription, the target server, and the target database. I set my authentication type to Service Principal, since I had granted access to the Azure DevOps service principal during setup. My Deploy type was set to SQL Script File. And my SQL Script location was set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current/Database1.sql. An important configuration on this task is to set Run this task to Custom Conditions. I specified my custom condition as eq(variables['DB1Script'], 'True'). This tells the task to execute only if the variable we set in the previous task is equal to True. It seems data types don’t really matter here – I set the variable in the PowerShell script as a boolean, but it was converted to a string in Azure DevOps.

My third and fourth tasks do the same activities for Database2 and Database3.

My last task, called Move Change Scripts to Previous Folder, is another inline PowerShell script. The script uses the variables set in the first task to determine if the file exists and needs to be moved. Because the script is working on a local copy on the VM, it must commit the changes back to the git repo after renaming and moving the file(s). The Working Directory for this task is set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts.

Here’s the code:

# Write your PowerShell commands here.

Write-Host "Getting Files"
$path = Get-Location
Write-Host $path


$File1 = "Current/Database1.sql"
$File2 = "Current/Database2.sql"
$File3 = "Current/Database3.sql"

$IsDB1Script = Test-Path -Path $File1 -PathType Leaf
$IsDB2Script = Test-Path -Path $File2 -PathType Leaf
$IsDB3Script = Test-Path -Path $File3 -PathType Leaf

$DateString = Get-Date -Format "yyyyMMddHHmm"

$NewFile1 = "Previous/Database1$($DateString).sql"
$NewFile2 = "Previous/Database2$($DateString).sql"
$NewFile3 = "Previous/Database3$($DateString).sql"


 Write-Host "Changing file locations"
    git config --global user.email "myemail@domain.com"
    git config --global user.name "My Release Pipeline"

   
if($IsDB1Script) 
{
    $File1 |  git mv ($File1) ($NewFile1) 
    Write-Host "Moved Database1 script"
}

if($IsDB2Script)
{
    $File2 |  git mv ($File2) ($NewFile2) 
  Write-Host "Moved Database2 script"
}

if($IsDB3Script)
{
  $File3 |  git mv ($File3) ($NewFile3) 
  Write-Host "Moved Database3 script"
}

git commit -a -m 'Move executed change scripts to Previous Folder'
git push origin HEAD:master

Anything you output from your PowerShell script using Write-Host is saved in the release logs. This is helpful for troubleshooting if something goes wrong.

As you can see above, you can execute git commands from PowerShell. This Stack Overflow question helped me understand the syntax.

And with those five steps, I’m able to execute change scripts for any or all of my three databases. We don’t anticipate adding many more databases, so this configuration is easily manageable for the situation. I wouldn’t do this (one step per database) for a situation in which I have dozens or hundreds of databases that may be created or deleted at any time.

Azure, Azure Data Factory, KQL, Microsoft Technologies

Looking at Activity Queue Times from Azure Data Factory with Log Analytics

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.

Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

You can see queue time in the ADF Monitor by looking at the output of an activity.

Azure Data Factory pipeline execution details in the ADF monitor. The activity output is open, showing the resulting JSON. There is a property called durationInQueue, which contains the queue time in seconds. The result shown is 2 seconds.
Output from a stored procedure activity showing 2 seconds of queue time

But what if you want to see activity queue times across activities, across pipelines, or even across data factories? Then you need to output your logs to somewhere that makes this easier.

Send ADF Logs to Log Analytics

You can output your ADF logs to a storage account, to Log Analytics, to an event hub, or to a partner solution. I prefer Log Analytics because it’s easy to query and look for trends using KQL.

To configure the output to Log Analytics, you must create a Log Analytics workspace (if you don’t have an existing one) and add a diagnostic setting to the data factory resource. Once you have data feeding into Log Analytics, you can query it.

If you choose resource-specific destination tables in the diagnostic setting, you will find a table in Log Analytics called ADFActivityRun. This table contains a column called Output. The Output column contains the JSON we see in the ADF Studio Monitor app.

KQL has functions for parsing JSON and retrieving only the JSON objects I want to include. This means that I could write a query like the following.

ADFActivityRun
| extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue',Output, typeof(int))
| where Status == 'Succeeded'
| where queuetime > 0
| project Start, End, PipelineName, ActivityType, ActivityName, dur = datetime_diff('second', End, Start), queuetime, PipelineRunId, ActivityRunId
| sort by queuetime desc

This query gives me a list of activities with successful executions that have queue times greater than zero. I can choose from any columns in the ADFActivityRun table, including the pipeline and activity names, start and end times, activity types, and run IDs. Duration is not an available column so I had to calculate it by calculating the difference between the start and end time. The queue time is buried in the JSON in the Output column, so I used the extractjson function to get the duration in queue value.

Now that I know how to get the queue duration, I can look for trends across various slices of data. A query to get average queue time by activity type might look like the below.

ADFActivityRun
| where Status == 'Succeeded'
| where startofday(Start) == datetime(2022-01-04)
| extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue', Output, typeof(int))
| summarize avg_queuetime = avg(queuetime) by ActivityType
| sort by avg_queuetime desc

In this query, I am retrieving activities with successful executions that started on January 4, 2022. I added my calculation to retrieve queue time. Then I calculated average queue time by activity type using the summarize operator and sorted the result descending by queue time.

I could have filtered on any other available activity: pipeline name, activity name, integration runtime, resource ID, and more.

This information is available via the API as well, but a Log Analytics workspace can be spun up and running in minutes without having to write code to connect.

If you are used to writing SQL, the transition to KQL isn’t too bad. Check out the SQL to Kusto query translation page in Microsoft Docs.

Azure, Azure Data Factory, Microsoft Technologies

When You Can’t Change the Connected Git Repo on ADF

I was working on an Azure Data Factory project for a client who is new to ADF, and there was a miscommunication about the new Git Repo to be used for source control. Someone had created a new project and repo instead of using the existing one created for this purpose. This isn’t a big deal, as it’s easy enough to change in ADF Studio.

The Git Configuration page in Azure Data Factory Studio shows the connected repository and has a button to disconnect from the repository.
The Management Hub in Azure Data Factory Studio contains the Git configuration settings

In the Management Hub, you can change the Git configuration for the data factory, and there is a button near the top to disconnect the repo. You may need to do this if you run into conflicts in the publish branch or when you need to change repos.

It should be as easy as that, but I ran into a situation where it wasn’t.

Disconnect Button Unavailable

When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.

I asked my client to disconnect the repo and moved on with the project, but I also logged some feedback for the Data Factory team. You can do this by selecting the Feedback button near the top right of the page in Azure Data Factory Studio. I have done this a few times and always received a prompt response.

The feedback button in ADF Studio is selected. It opens a new dialog that allows you to choose from two options: "I have feedback" and "I have a feature suggestion". From there, you can enter your email address and your feedback.
You can send feedback from within Azure Data Factory Studio

I sent my feedback and received an email response within a couple of days. It contained a few helpful bits of information.

To disconnect git from the ADF management tab, you must be in git mode and be able to access the repository. If you are not able to access the repository but have permissions to update the factory, you can remove the git configuration via the REST API. Here is documentation on this API call: Factories – Configure Factory Repo – REST API (Azure Data Factory) | Microsoft Docs. In this case, repoConfiuration should be set to null in the request body, and the rest of the PUT body should be the same as the existing settings.

Alternatively, the git connection can be removed from the ADF management tab by either gaining access to the repository or having another ADF user with access remove the configuration. I also understand that this is not an ideal experience, so I have filed a work item to remove the requirement to have access to the repository.

This identified my situation exactly. I did not have access to the new repo that had been connected to the data factory. That caused me to be unable to disconnect it.

Hopefully the work item that removes the requirement for repo access before you can disconnect the data factory will be completed soon, but if you run into this issue, you can resolve it on your own using the API or a colleague with repo access.

Azure Data Factory, Logic Apps, Microsoft Technologies

Copying large files from SharePoint Online

I recently worked on a project where we needed to copy some large files from a specified library in SharePoint Online. In that library, there were several layers of folders and many different types of files. My goal was to copy files that had a certain file extension and a file name that started with a specific string.

I began by comparing the capabilities between Azure Data Factory and Logic Apps. Logic Apps allows me to start the process via a webhook. It also has a SharePoint online connector that provides easy AAD authentication. And there is a special operation that allows me to retrieve the properties (path, filename) of files without retrieving the files. This makes it easy to loop through the list of files that is already filtered to only the files I want.

Logic Apps process: 1) When an HTTP request is received 1) Get files (properties only) 3) For each 4a) Success Response 4b) Failure response
Logic App that retrieves file metadata from SharePoint and then performs an action for each file in the list

Azure Data Factory does have a SharePoint Online connector, but you can’t use it to copy files. You must use the HTTP connector to copy files from SharePoint in Data Factory. And that requires you to make a web call first to get a bearer token to be used in the authentication for the source of the copy activity.

While the ADF options work, I thought I would go with the Logic App because it felt cleaner and easier for others to understand. That is until I tried to copy some large files and encountered the following error.

Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'.

It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.

At that point I would have to figure out how to handle the chunking using an HTTP call. That made the Logic App no better than Data Factory for my purposes. And since I already had part of the larger process in Data Factory, I went back there to see how it handled chunking.

Copying a Subset of Files from SharePoint Online Using Data Factory

And it turns out you don’t have to do anything special in Data Factory to handle the chunking of the large files – it does that work for you.

So I set up a similar process to what I had in my Logic App in my ADF pipeline.

Lookup to get file properties with path to For Each loop
ADF pipeline that retrieves a list of files in a SharePoint library and performs an action on each file

First, I used a Lookup activity to get the list of files. When you do a lookup against a SharePoint Online dataset, you can use OData filters to limit the data returned by the lookup. In my case this was necessary because I only wanted 144 of the items out of the list of 4300. Lookups can only return 5,000 rows, so I definitely needed to add filters to my query. Also, folders are considered an item in the data returned, so if you are looking only for files, you need to filter out folders (either by content type or by the name of the file).

Settings for a Lookup activity in ADF. It uses a SharePoint online source dataset. The Use query setting is set to query. The Query setting contains the text $filer=endswith(Name, .bak') and startswith(Name, 'Dama'). First row only is not enabled.
Data Factory Lookup against a SharePoint Online list using an OData query with filters

The results of the lookup, returning one item per file, are fed into a For Each activity. Inside the For Each activity, I placed a Web activity to get the necessary bearer token and a Copy activity to copy the file to blob storage.

Data Factory web activity with a path to a copy activity
Inside the For Each activity is a Web activity to retrieve a bearer token, followed by a Copy activity that copies the files to blob storage.

To get a bearer token, you need a service principal with the correct permissions in SharePoint. Then you make a POST to https://accounts.accesscontrol.windows.net/{tenantID}/tokens/OAuth/2.

You need the following header: Content-Type: application/x-www-form-urlencoded.

In the body of the HTTP call, you must include: grant_type=client_credentials&client_id={clientid}@{tenantid}&client_secret={secretvalue}&resource=00000003-0000-0ff1-ce00-000000000000/{SharePoint tenant}.sharepoint.com@{tenantID}.

Make sure to set Secure Output so your credentials aren’t written to the ADF logs. Preferably, you would store the client ID and client secret in Key Vault and use a web activity to look them up before getting the token.

You can then use the output from the web activity in your copy activity.

ADF Copy Activity Source Settings: Binary dataset. Path and Name are passed as parameters. The request method is GET. The additional header contains the auth token from the web activity.
Source settings for a Copy activity in ADF that copies files from SharePoint

The Additional Headers should be populated with the auth token as shown below:

Authorization: Bearer @{activity('WEB_GetToken').output.access_token}

Since I don’t need to read the file contents, just copy it to blob storage, I used a binary dataset. I parameterized the dataset and linked service so my For Each activity could feed it the path and name of each file. This is because the lookup provides the path separately from the file name (which includes the file extension).

Binary dataset with linked service pointing to the HTTP linked service and parameters for Path and Name. The value for Path is @dataset().Path. The value for Name is @dataset().Name. The relative URL is not populated.
Binary dataset using an HTTP linked service used for retrieve files from SharePoint online

The linked service uses the HTTP connector. It has two parameters added: path and name.

HTTP linked service with parameters for path and name. Authentication is anonymous.
HTTP linked service used to retrieve files in a SharePoint Online library in Data Factory

The base url I used in the linked service is:

@{concat('https://{tenant name}.sharepoint.com/sites/{site name}/_api/web/GetFileByServerRelativeUrl(''',replace(linkedService().Path,' ','%20'),'/',replace(linkedService().Name,' ','%20'),''')/$value')}

If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.

So once I add my destination info to the Copy Activity, I can copy all the files retrieved from my lookup to blob storage.

A Few Tricky Things

Everyone’s SharePoint tenant is set up differently. Your url will vary based upon the configuration of your site collections/sites. When I tried this in my tenant, I didn’t have to include the “site/sitename” before “/_api/web…”. But I did in my client’s tenant when I was building this.

We granted permissions to the service principal at the site collection level to get this to work. If there was a way to grant lower permissions, we couldn’t figure it out. Here the XML we used when the SharePoint admin was granting permissions.

<AppPermissionRequests AllowAppOnlyPolicy="true">  
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web 
   Right="Read" />
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web/list 
   Right="Read" />
</AppPermissionRequests>

Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error. That is why I moved my web activity inside the For Each activity. If you know it’s only a few files and they are quick, you should just get one bearer token and use it for all the files.

Lastly, be sure to set retries and timeouts appropriately. Remember that the default timeout for an activity is 7 days, and that is probably not desirable.

It took a lot of trial and error for me to get this working, as I couldn’t find many blog posts or docs about this. Special thanks to Bill Fellows for helping me through it. I sincerely hope this blog post helps someone avoid my struggles.

Accessibility, Data Visualization, Microsoft Technologies, Power BI

What are those new buttons under tab order in Power BI?

If you’ve visited the Tab order area of the Selection Pane in Power BI in the last couple of months, you might have noticed some new buttons.

The selection pane in Power BI desktop with Tab order selected. There are three buttons underneath the Tab order heading.
Three new buttons for managing tab order in Power BI

The hover text on the first button says “Expand All”. This button is useful if you have grouped visuals. Groups are indicated by a carat to the left of the item in the tab order list.

The selection pane with tab order showing. The fourth item is a group titled Summary Cards. The items within the group are not shown.
Tab order for a report page containing one group

Selecting the Expand all button shows the individual objects within a group.

The selection pane with tab order showing. The fourth item is a group titled Summary Cards. The items within the group are shown in an indented list under the group name.
Tab order with the group expanded

The second button is the Collapse All button. It will collapse the groups so only the name of the group is shown and not the individual objects within the group.

The third button is a great new addition: Have tab order match visual order.

The Selection pane is shown with Tab order selected. The third button has hover text that reads "Have tab order match visual order".
The option to have tab order match visual order is the third button under Tab order.

This button will set the tab order for the visuals on the page to sort ascending by Y and then X coordinates. Let’s look at an example.

I have a report page containing 7 textboxes.

A Power BI report page with a box in the top left corner and a grid of 6 boxes underneath, spanning the entire width of the report. The order of the boxes appears random, but it matches the order in which they were added to the page.
The X,Y coordinates of each box are shown in the box. The original tab order is indicated by the numbers in the circle.

After clicking the button to have tab order match visual order, the tab order is changed shown below.

A Power BI report page with a box in the top left corner and a grid of 6 boxes underneath, spanning the entire width of the report. The order of the boxes matches the Y and X coordinates of each visual, starting at the top left and moving down to the bottom right.
Tab order set with the top left visual being first and the bottom right visual being last

This is often the correct tab order that matches how we read the report visually. This little button can increase keyboard/screen reader accessibility in one second instead of taking a couple of minutes per page.

There will be times that this tab order will not be what you want. Some exceptions might be when you use visuals that have a different amount of space inside the visual container, so the containers are intentionally misaligned (according to the X,Y coordinates) in order for the content to appear visually aligned. Then you might need to customize your tab order a bit. Another exception might be if you have some buttons or links at the top right of the report page that you want a user to visit last (after the content of the report). In that case, you would customize your tab order to make the button last.

But the majority of the time, this option to make tab order match visual order is exactly what you need. I applaud the Power BI team for taking this step to make creating accessible reports a little easier.

Microsoft Technologies, Power BI, Workout Wednesday

Power BI, Maps, and Publish to Web

October 2021 is mapping month over at Workout Wednesday for Power BI. As part of our challenges, we build a sample report and use the Publish to Web functionality to share it on the website. While this has worked well all year, there are some visuals, including maps, that do not support or require a different license for use with Publish to Web.

It’s frustrating to build a Power BI report that you plan to share, only to find that you can’t share it. So I thought it would be helpful to consolidate what I have found about the various map visuals and their support of Publish to Web.

Disclaimer: This information is correct as of October 14, 2021. This could change over time. This is not an exhaustive list of all the map visuals available for Power BI.

Map Visuals

6 map visuals on a power bi report: a bubble map, filled map, shape map, ArcGIS map, Azure Map, Mapbox map.
Examples of the 6 map visuals tested with Publish to Web

Note: I also tested several other AppSource visuals, but they failed to render in Power BI desktop. I may update this post if they start working again.

I hope this helps you plan your visuals when you need to publicly share a report that contains a map.

Excel, Microsoft Technologies, Power BI

Connect Excel to a Power BI Dataset in a Premium Workspace with a B2B User

Power BI offers the ability for users who have access to a dataset in the Power BI service (PowerBI.com) to connect to the dataset using Excel. Normally, this feature is referred to as Analyze in Excel. Once you connect Excel to your dataset, you can create Pivot Table reports or use Cube Functions.

There are currently limitations that mean this functionality isn’t supported for B2B (external) users. An external user is an Azure AD user that is based in another tenant and has been guested into the local AAD tenant. If you go to your dataset in PowerBI.com. choose Analyze in Excel, and then try to open the downloaded file and connect to the dataset, you will be met with connection errors.

But if you have your dataset in a workspace backed by Premium Per User or Premium capacity, you can use the XMLA endpoint to connect, even if you are using a B2B user!

Instead of using the Analyze in Excel functionality, you can connect to your dataset as if it were Analysis Services, using the XMLA endpoint. B2B users just need to make one adjustment to the server name they enter to make this work.

In Excel, locate the Get Data button. Select From Database and then From Analysis Services.

Get Data menu in Excel with the options From Database and From Analysis Services selected.
Connecting to a Power BI dataset using the XMLA endpoint in Excel is done in a similar manner as connecting to an Analysis Services database

Open a browser window and go to the settings of the Power BI workspace that contains the dataset to which you want to connect.

Settings for a Power BI workspace called Demo Reports.
Power BI Premium workspaces of any kind should have the workspace connection string listed in the settings pane

If your workspace is backed by Premium capacity, you will be able to see this in the settings and the workspace connection will be available for you to copy. If you are a member user (not external) you could copy this info into the Server Name box of the Data Connection Wizard and go on your way.

If you are a B2B user, you need to make an adjustment as noted in Microsoft Docs. You need to replace “myorg” in the workspace connection with your primary domain name. If you have access to the Azure portal, you can find the primary domain name on the overview page for the Azure Active Directory.

Overview page in the Azure Portal for Azure Active Directory with the Primary Domain circled under basic information.
The tenant UPN, also called primary domain can be found in the Azure Portal on the AAD overview page

So if the workspace connection from the Power BI service is:
powerbi://api.powerbi.com/v1.0/myorg/Demo%20Reports

And your primary domain is:
mysupercooldomain.com

Then you would change the workspace connection to:

powerbi://api.powerbi.com/v1.0/mysupercooldomain.com/Demo%20Reports

Once you have populated the server name with the workspace connection string, change the logon credentials to “Use the following user name and password” but leave the credentials blank. Once you select the Next button, you will be prompted for your Azure credentials.

Then you will be able to select the desired dataset from the workspace and be on your way to making connected Excel reports.

Azure, Azure Data Factory, DCAC, Microsoft Technologies

Slides and Video from Building a Regret-free Foundation for your Data Factory Now Available

Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.

Here’s our agenda from the presentation.

Slide showing top regrets of data factory users: Poor resource organization in Azure
Lack of naming conventions
Inappropriate use of version control
Tedious, manual deployments
No/inconsistent key vault usage
Misunderstanding integration runtimes
Underutilizing parameterization
Lack of comments and documentation
No established pipeline design patterns
List of top regrets from Data Factory users that they wish they had understood from the beginning

If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.

If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.

I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.