Azure, Databricks, Microsoft Technologies, Unity Catalog

Creating a Unity Catalog in Azure Databricks

Unity Catalog in Databricks provides a single place to create and manage data access policies that apply across all workspaces and users in an organization. It also provides a simple data catalog for users to explore. So when a client wanted to create a place for statisticians and data scientists to explore the data in their data lake using a web interface, I suggested we use Databricks with Unity Catalog.

New account management and roles

There are some Databricks concepts that may be new for you in Azure when you use Unity Catalog. While Databricks workspaces are mostly the same, you now have account (organization) -level roles. This is necessary because Unity Catalog crosses workspaces.

Databricks account covers all workspaces in your Azure tenant. There are 3 account-level roles: account admin, metastore admin, account user. Underneath the account are one or more workspaces. Workspace roles include workspace admins and workspace users.
There are two levels of accounts and admins in Azure Databricks

Users and service principals created in a workspace are synced to the account as account-level users and service principals. Workspace-local groups are not synced to the account. There are now account-level groups that can be used in workspaces.

To manage your account, you can go to https://accounts.azuredatabricks.net/. If you log in with an AAD B2B user, you’ll need to open the account portal from within a workspace. To do this, go to your workspace and select your username in the top right of the page to open the menu. Then choose the Manage Account option in the menu. It will open a new browser window.

The user menu in a Databricks workspace with the fourth option, Manage Account emphasized.
To launch the account console, choose Manage Account from the menu under your username in a workspace

Requirements

To create a Unity Catalog metastore you’ll need:

The pricing tier is set on the basics page when creating the Databricks workspace.

The Basics page of the Create and Azure Databricks workspace workflow in the Azure Portal has a pricing tier option near the bottom. Make sure this is set to Premium (+ Role-based access controls).
If you plan to use Unity Catalog, be sure to select the Premium pricing tier for your Databricks workspace

High-level steps

  1. Create a storage container in your ADLS account.
  2. Create an access connector for Azure Databricks.
  3. Assign the Storage Blob Data contributor role on the storage account to the managed identity for the access connector for Azure Databricks.
  4. Assign yourself account administrator in the Databricks account console.
  5. Create a Unity Catalog metastore.
  6. Assign a workspace to the Unity Catalog metastore.

The storage container holds the metadata and any managed data for your metastore. You’ll likely want to use this Unity Catalog metastore rather than the default hive metastore that comes with your Databricks workspace.

The access connector will show up as a separate resource in the Azure Portal.

The Azure Portal showing two resources in a list. The type of the first resource is Access Connector for Azure Databricks. The type of the second resource is Azure Databricks Service.

You don’t grant storage account access to end users – you’ll grant user access to data via Unity Catalog. The access connector allows the workspace to access the data in the storage account on behalf of Unity Catalog users. This is why you must assign the Storage Blob Data Contributor to the access connector.

The Add role assignment page for Access Control on the storage account. The selected role is Storage Blob Data Contributor. Assign access to is set to Managed Identity. The member listed is DBxAccessConnector, which is the access connector for Azure Databricks.
Assigning the Storage Blob Data Contributor role to the managed identity for the Access Connector for Azure Databricks

The confusing part of setup

If you are not a Databricks account administrator, you won’t see the option to create a metastore in the account console. If you aren’t an AAD Global Admin, you need an AAD Global Admin to log into the Databricks account console and assign your user to the account admin role. It’s quite possible that the AAD Global Admin(s) in your tenant don’t know and don’t care what Databricks or Unity Catalog is. And most data engineers are not global admin in their company’s tenant. If you think this requirement should be changed, feel free to vote for my idea here.

Once you have been assigned the account admin role in Databricks, you will see the button to create a metastore in the account console.

The Databricks account console has a large green button labeled Create a metastore, which is only visible to account admins.
The Create a meatastore button is only available for Databricks account admins

One or multiple metastores?

The Azure documentation recommends only creating one metastore per region and assigning that metastore to multiple workspaces. The current recommended best practice is to have one catalog that spans environments, business units, and teams. Currently, there is no technical limitation keeping you from creating multiple metastores. The recommendation is pointing you toward a single, centralized place to manage data and permissions.

Within your metastore, you can organize your data into catalogs and schemas. So it could be feasible to use only one metastore if you have all Databricks resources in one region.

In my first metastore, I’m using catalogs to distinguish environments (dev/test/prod) and schemas to distinguish business units. In my scenario, each business unit owns their own data. Within those schemas are external tables and views. Because it is most likely that someone would need to see all data for a specific business unit, this makes it easy to grant user access at the schema level.

I’ll save table creation and user access for another post. This post stops at getting you through all the setup steps to create your Unity Catalog metastore.

If you prefer learning from videos, I’ve found the Unity Catalog videos on the Advancing Spark YouTube channel to be very helpful.

Azure, Azure Data Factory, Microsoft Technologies

The Reason We Use Only One Git Repo For All Environments of an Azure Data Factory Solution

I’ve seen a few people start Azure Data Factory (ADF) projects assuming that we would have one source control repo per environment, meaning that you would attach a Git repo to Dev, and another Git repo to Test and another to Prod.

Microsoft recommends against this, saying:

“Only the development factory is associated with a git repository. The test and production factories shouldn’t have a git repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource Management template.”

Microsoft Learn content on continuous integration and delivery in Azure Data Factory

You’ll find the recommendation of only one repo connected to the development data factory echoed by other ADF practitioners, including myself.

We use source control largely to control code changes and for disaster recovery. I think the desire to use multiple repos is about disaster recovery more than anything. If something bad happens, you want to be able to access and re-deploy your code as quickly as possible. And since we start building in our repo-connected dev environment, some people feel “unprotected” in higher environments.

But Why Not Have All the Repos?

For me, there are two main reasons to have only one repository per solution tied only to the development data factory.

First, having multiple repos adds more complexity with very few benefits.

Deployment Complexity

Having a repo per environment adds extra work in deployment. I can see no additional benefits for deployment from having a repo per environment for most ADF solutions. I won’t say never, but I have yet to encounter a situation where this would help.

When you deploy your data factory, whether you use ARM templates or individual JSON files, you are publishing to the live ADF service. This is what happens when you publish from your collaboration branch in source control to the live version in your development data factory. If you follow normal deployment patterns, you deploy from the main (if you use JSON files) or adf_publish (if you use the ARM template) branch in source control to the live service in Test (or whatever your next environment is). If your Test data factory is connected to a repo, you need to figure out how to get your code into that repo.

Would you copy your code to the other repo before you deploy to the service? What if something fails in your deployment process before deployment to the live service is complete? Would you want to revert your changes in the Git repo?

You could deploy to the live service first and skip that issue. But you still need to decide how to merge your code into the Test repo. You’ll need to handle any merge conflicts. And you’ll likely need to allow unrelated histories for the merge to work, so when you look back in your commit history, it probably won’t make sense.

A comparison of the normal process vs the process with one repo per environment. In the normal process you work in ADF Studio and save to your Git repo. You publish to higher environments from that Git repo. 
with multiple repos, you have the extra step per environment of getting the code you just deployed to the live service into the associated environment's repo.
Comparison of having one repo vs one repo per environment shows the need for extra steps if you add repos

At best, this Test repo becomes an additional place to store the code that is currently in Test. But if you are working through a healthy development process, you already had this code in your Dev repo. You might have even tagged it for release, so it’s easy to find. Your Git repo is likely already highly available, if it is cloud-based. In my mind, this just creates one more copy of your code that can get out of date, and one more deployment step. If you just want a copy of what is in Test or Prod for safe keeping, you can always export the resource’s ARM template. But if I were to do that, I would be inclined to keep it in blob storage or somewhere outside of a repo, since I already have the code in a repo. This would allow me to redeploy if my repo weren’t available.

Then, once you have sufficiently tested your data factory in Test, would you deploy code to Prod from the Test repo or from the Dev repo?

If you have the discipline and DevOps/automation capabilities to support these multiple repos, you likely don’t want to do this, unless you have requirements that mandate it. That brings me to my second reason.

Deviation from Common DevOps Practice

Having a repo per environment is a deviation from common software engineering practices. Most software engineering projects do not have separate repos per environment. They might have separate repos for different projects within a solution, but that is a different discussion.

If you have a separate repo for dev and test, what do you do about history? I think there is also a danger that people would want to make changes in higher environments instead of working through the normal development process because it seems more expedient at the time.

When you hire new data engineers or dev ops engineers (whoever maintains and deploys your data factories), you would have to explain this process with the multiple repos as it won’t be what they are expecting.

Unless you have some special requirements that dictate this need, I don’t see a good reason to deviate from common practice.

Common Development Process

For a data factory project, we must define a collaboration branch, usually Main. This branch is the only branch that can publish to the live service in your Dev data factory. When you need to update your data factory, you make a (hopefully short-lived) feature branch based off of your collaboration branch. My preference for a medium to large project is to have the Main branch, an Integration branch, and one or more feature branches. The Integration branch brings multiple features together for testing before the final push to Main. On smaller projects with one or two experienced developers, you may not need the integration branch. I find that I like the integration branch when I am working with people who are new to ADF, as it gives me a chance to tweak and execute new pipelines before they get to Main.

Developers work in the feature branches and then merge into the integration branch as they see fit. They resolve any errors and make any final changes in integration and then create a pull request to get their code into Main. Once the code is merged into Main and published to the live service (either manually or programmatically), the feature branches and Integration branch are deleted, preparing you to start the next round of development. Triggering the pipelines in the live service after publishing gives you a more realistic idea of execution times as ForEach activities may not run in parallel when executed in debug mode.

We start with the Main branch. Integration and feature branches start as a copy of Main. Changes are made to Feature branches. When feature development is complete, the feature branch is merged into Integration. Once all features are merged into Integration, final review and updates are made, and then code is deployed to Main and published to the service.
The development starts by creating feature branches and an integration branch from Main. Code is merged into Integration by the developer. Code in integration is moved to Main via pull request. Code from Main is published to the live service.

The code in Main should represent a version of your data factory that is ready to be deployed to Test. Code is deployed from Dev to Test according to your preference—I won’t get into all the options of JSON files vs ARM templates and DevOps pipelines vs PowerShell/custom code in this post.

You perform unit testing, integration testing, and performance testing (and any other type of testing as well, but most people aren’t really doing these three in any sufficient manner) in your Test data factory. If everything looks good, you deploy to Production. If there are issues, you go back to your development data factory, make a new feature branch from Main, and fix the issue.

If you find a bug in production, and you can’t use the current version of code in Main, you might want to create a hotfix/QFE. Your hotfix is still created in your development data factory. The difference is that instead of creating a feature branch from Main, you create the branch from the last commit deployed to production. If you are deploying via ARM templates, you can export the ARM template from that hotfix branch and manually check it in to the adf_publish branch. If you deploy from JSON files, selective deployment is a bit easier. I like to use ADF Tools for deployment, which allows me to specify which files should be deployed, so I can do a special hotfix deployment that doesn’t change any other objects that may have already been updated in Main in anticipation of the next deployment.

In Summary

Having a repo per environment doesn’t technically break anything, but it adds complexity without significant benefits. It adds steps to your deployment process and deviates from industry standards. I won’t go so far as saying “never”, as I haven’t seen every project scenario. If you were considering going this route, I would encourage you to examine the reasons behind it and see if doing so would actually meet your needs and if your team can handle the added complexity.

Consulting, Microsoft Technologies, Personal

Why I tweet about work and personal topics from the same account

Over the last few years, I’ve had a few people ask me why I don’t create two Twitter accounts so I can separate work and personal things.

I choose to use one account because I am more than a content feed, and I want to encourage others to be their whole selves. I want to acknowledge that we don’t work in isolation from what’s going on in the greater world.

I love learning things from people on Twitter. I find a lot of quick tips and SQL and Power BI blog posts from Twitter. I love seeing all the interesting data visualizations shared on Twitter.

A wooden table with a phone and a cup of espresso. The phone is opening the Twitter mobile app, and you can see the twitter icon in the center of the screen.
I often start my mornings with caffeine and Twitter.

But I also love learning personal things about people. I love to see photos of your dogs and views while hiking. I like knowing that my friend Justin is really into wrenches. And my coworker Joey (who I followed on Twitter long before I worked with him) is really into cycling. And my friend Matt is into racing. I followed all these people before I met them in person. Many of my Twitter friends became “IRL” friends after we met at conferences.

I definitely lurked around the SQL community (Power BI didn’t exist yet) for a while before I ever worked up the courage to say anything to anyone. And I started out with mostly data/work-related tweets. But as time went on, I realized how much I appreciated when others shared personal info about themselves, that helped me get to know them better. And I became more comfortable sharing more about me. So now I’m sort of giving back, with both professional and personal information.

Note: I do this personal/professional crossover only on Twitter, because I have deemed that to be an appropriate audience for both. I don’t share many personal things on LinkedIn. And I don’t share professional things on Instagram or Facebook. That’s my personal preference.

Are there risks to this? Absolutely.

People might stop following me because they don’t care for the personal content. My political opinions or obsession with my dog might turn some people off. Someone might be offended by my tattoos or my occasional use of profanity (which I never direct at someone and use more to express frustration/excitement/surprise). Or they may tire of my frequent use of gifs.

I know that potential and current clients see what I tweet. And it can and does affect their opinion of me. But when you hire me as your consultant, you get my personality and personal experiences as well as my tech knowledge. So, I don’t see it as being so very different from meeting me in real life at a conference or at another social event.

So far, it’s been an overall positive experience of having IRL conversations with clients about something I tweeted that they found helpful or entertaining. I do make a point not to name specific clients or projects unless I have their permission to do so (sometimes there are legitimate exceptions). I respect client privacy and confidentiality online and in person.

Before I could get to this place, I had to be secure in myself and secure in my employment and professional network. I recognize that not everyone will like me. That is true in person and on Twitter. And that’s ok. If you want to unfollow me, I’m ok with that. If you want to engage in conversations only about work stuff, that’s great. Feel free to mute words to tune out the rest of my tweets (your best bets are “Colorado”, “Denver”, “dog”, “kayak”, and “hike”). If you want to talk only about dogs and nature and my adorable and efficient camper, that’s also fine.

If you dig deep enough, I’m sure you can find some tweet that wasn’t my best moment. I’m sure I’ve said something regrettable in the 14 years since I joined Twitter. But I’m going to extend myself a little grace and remember that I’m human. And I’ll accept feedback if you think something I’ve said was unkind or made you feel unwelcome.

There is also a risk that someone can use this personal info against me, for harassment or identity theft. That is a risk for anyone sharing anything personal online. For now, I have assessed the risks and the rewards, and I find the rewards to outweigh the risks. I may decide differently later.

Do I recommend this for you?

Here’s the thing: I don’t think there are absolutes when it comes to how people use social media. If it makes you happy and it’s not hurting anyone, it’s probably fine.

It’s important to me that we remember that the people who teach us Azure and SQL and Power BI are people with non-work interests and personal struggles and interesting life experiences. And their more personal content gives me ways to relate to them outside of technology. Sometimes I learn really useful things from them, like the right type of lubricant to fix a squeaky door. Sometimes I notice a current event in their life that I can use to start a conversation at a conference.

Basically, I choose to use Twitter in a more human way. It’s working pretty well for me so far. You can decide if you have the desire and ability to do that. When this way of interacting with people stops being rewarding for me, I’ll reassess and take a different approach.

Azure SQL DB, Azure SQL DW, Data Warehousing, Microsoft Technologies, Python, SQL Server

pandas.DataFrame.drop_duplicates and SQL Server unique constraints

I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.

Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.

Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").

This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.

Case sensitivity in SQL Server

SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.

Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.

CREATE TABLE [dbo].[Products] (
    [ProductName] nvarchar(100) NOT NULL
)
GO

ALTER TABLE [dbo].[Products] 
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)

I can insert my first row.

Insert into dbo.Products (ProductName)
values ('abc123')

Now I try to insert another row where the letters are capitalized.

Insert into dbo.Products (ProductName)
values ('ABC123')

This fails with the following error:

Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).

This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).

Potential solution

If you need to get to a list of case insensitive distinct values, you can do the following:

  1. Create a column that contains the values of the ProductName converted to lower case.
  2. Run the dropDuplicates on the lowercase column.
  3. Drop the lowercase column.

You can see code on this Stack Overflow solution as an example (as always, please read it and use at your own risk).

Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.

Accessibility, Data Visualization, Microsoft Technologies, Power BI

Quality Checks for your Power BI Visuals

For more formal enterprise Power BI development, many people have a checklist to ensure data acquisition and data modeling quality and performance. Fewer people have a checklist for their data visualization. I’d like to offer some ideas for quality checks on the visual design of your Power BI report. I’ll update this list as I get feedback or new ideas.

Checklist illustration by Manypixels Gallery on IconScout

The goal of my data visualization quality checklist is to ensure my report matches my intended message, audience, and navigation.

There are currently 4 sections to my PBI data viz quality check:

  1. Message check
  2. Squint test
  3. Visual components check
  4. Accessibility check

Message check

  • Can you explain the purpose/message of your report in a single sentence?
  • Can you explain how each visual on the page supports that purpose/message?

I use the term purpose more often when I have a report that supports more exploratory data viz, allowing users to filter and navigate to find their own meaning in their own decision contexts. Message is much easier to define in explanatory data viz, where I intend to communicate a (set of) conclusion(s). My purpose or message statement often involves defining my intended audience.

If you cannot define the purpose/message of your report page, your report may be unclear or unfocused. If you can’t identify how a visual supports the purpose/message, you may consider removing or changing the visual to improve clarity and usefulness.

Squint test

You can perform a squint test by taking a step back and squinting while viewing your report page. Alternatively, you can use a desktop application or browser add-in that blurs the page, simulating far sightedness. Looking at the blurry report page helps you evaluate the visual hierarchy.

  • What elements on the page stand out? Should they? Areas of high contrast in color or size stand out. People read larger things first.
  • Does the page seem balanced? Is there significantly more white space or more bright color on one side of the page?
  • Does the page background stand out more than the foreground?
  • When visually scanning an image-heavy page (which follows a Z-pattern in Western cultures), does the order of items on the page make sense? Did you position explanatory text before the chart that needs the explanation? If there are slicers, buttons, or other items that require interaction before reading the rest of the page, are they placed near the top left?
  • Is there enough space between the items on the page to keep the page from feeling overly busy? Is it easy to tell where one visual ends and another begins?

Visual components check

I have two levels of visual components checks: reviewing individual visuals and reviewing the visuals across a report page.

Individual visuals components check:

  • Do charts have descriptive, purposeful titles? If possible, state a conclusion in the title. Otherwise, make it very clear what people should expect to find in your charts so they can decide if it’s worth the effort to further analyze them.
  • Are chart backgrounds transparent or using low-saturation colors? We don’t want a background color standing out more than the data points in the chart.
  • Are bright colors reserved for highlighting items that need attention?
  • Are visual borders too dark or intense? If every chart has a border that contrasts highly from the background, it can take the focus away from the chart and impede the visual flow. We often don’t need to use borders at all because we can use whitespace for visual separation.
  • Does the chart use jargon or acronyms that are unfamiliar to your intended audience? Try to spell out words, add explanatory text, and/or include navigation links/buttons to a glossary to reduce the amount of effort it takes to understand the report.

Visual components check – across the page:

  • If your report contains multiple slicers, are they formatted and positioned consistently?
  • Are items on the page that are located close to each other related? Proximity suggests relationships.
  • Are colors used within and across the page easily distinguishable?
  • Are fonts used consistently, with only purposeful deviations?
  • If charts should be compared, are the axis scales set to facilitate a reasonable comparison?
  • Does the interactivity between elements on the page provide useful information?
  • Are visuals appropriately aligned? Misalignment can be distracting.

Accessibility Check

I have a more comprehensive accessibility checklist on my blog that I keep updated as new accessibility features and tools are released. Below are some important things you can check to ensure your report can be read by those with different visual, motor, and cognitive conditions.

  • Do text and visual components have sufficient color contrast (generally, 4.5:1 for text and 3:1 for graphical components)?
  • Is color used as the only means of conveying information?
  • Is tab order set on all non-decorative visuals in each page? Decorative items should be hidden in tab order.
  • Has alt text been added to all non-decorative items on the page?
  • Is key information only accessible through an interaction? If so, you may consider rearranging your visuals so they are pre-filtered to make the important conclusion more obvious.
  • If you try navigating your report with a keyboard, is the experience acceptable for keyboard-only users? Does accessing important information require too many key presses? Are there interactive actions that can’t be performed using a keyboard?
  • Do you have any video, audio, or animation that auto-plays or cannot be controlled by the report user?
Microsoft Technologies, Power BI, Power Query

Generating Unicode Characters in Power Query

You may have used the UNICHAR() function in DAX to return Unicode characters in DAX measures. If you haven’t yet read Chris Webb’s blog post on the topic, I recommend you do. But did you know there is a Power Query function that can return Unicode characters? This can be useful in cases when you want to assign a Unicode character to a categorical value.

In Power Query, you can use Character.FromNumber to return a Unicode character based upon the specified number.

I recently used this function in a Workout Wednesday for Power BI exercise to visualize music notes. I made a scatterplot that used eighth notes and eighth rests as the markers.

To create an eighth note (𝅘𝅥𝅮), I added Character.FromNumber(9834) to my Power Query expression. The eighth rest is Character.FromNumber(119102). There are many music-related Unicode characters available.

With the Character.FromNumber function, we can make a custom column that uses an If statement to determine the correct character to use for each row in a table.

One thing to note is that the query preview in the Power Query Editor doesn’t always render the Unicode values correctly. This doesn’t mean it won’t work in your report, though.

For instance, Unicode character 119102 doesn’t look like the eighth rest when viewed in the Power Query preview.

The question mark in a box represents a unicode character that could not be properly rendered

But after applying changes and sending the data into the dataset, you’ll see your expected output in the Data view in Power BI Desktop.

Output of the custom column containing Unicode characters in the Data view in Power BI Desktop

The Unicode character does show correctly in Power Query Online when creating a dataflow.

While DAX is the answer for generating music notes or emojis based upon aggregated values, Power Query can help you generate a Unicode value per row in a table by using the Character.FromNumber function.

Microsoft Technologies, Power BI

Log in to Power BI Desktop as an External (B2B) User

I noticed Adam Saxton post a tip on the Guy in a Cube YouTube channel about publishing reports from Power BI Desktop for external users. According to Microsoft Docs (as of June 21, 2022), you can’t publish directly from Power BI Desktop to an external tenant. But Adam shows how that is now possible thanks to an update in Azure Active Directory.

To sign into another tenant in Power BI Desktop:

  1. Click the Sign in button.
  2. Enter your email address (i.e., username@domain) and click Continue.
  3. When the sign-in dialog appears, select Use another account.
  4. Select sign-in options.
  5. Select Sign in to an organization
  6. Enter the organization’s domain (e.g., mycompany.org) and select Next.
  7. You will then be shown a list of users with your user signed in to the external tenant. Select that user.

Then you will be signed in with your B2B user to the external tenant. This allows you to build “thin” reports off of shared datasets as well as publish to PowerBI.com from Power BI Desktop!

It would still be better for usability if Power BI had a tenant switcher similar Azure Data Factory or the Azure Portal, but this works perfectly fine in the meantime.

A couple of minor gaps

As Adam notes in his video, sensitivity labels don’t get applied if you publish from Power BI Desktop using an external user.

Also, once you publish, the link in the success dialog won’t work because it doesn’t contain the CTID in the URL. If I click on the link labeled “Open ‘WoW2022 Week 24.pbix’ in Power BI” in the screenshot below, it will try to open the link as if the report were in my tenant.

Dialog in Power BI desktop that indicates that publishing was successful. It says "Success! Open 'WoW2022 Week 24.pbix' in Power BI". The name of the file is a link to PowerBI.com.
Success dialog seen after successfully publishing from Power BI Desktop

That won’t work since the report is in another tenant, so next I’ll get the error “Sorry, we couldn’t find that report”. You can go to the external tenant (by including the CTID in your url) and see your report is published, so don’t worry too much about this.

This tip makes life a bit easier for me when I’m publishing multiple times in a row from Power BI desktop. It’s fewer clicks than using the Upload a file dialog in PowerBI.com. I hope Microsoft will round out the B2B features so all links generated by Power BI have the CTID in the url, and maybe add the tenant switcher, too.

Accessibility, Data Visualization, Power BI

Viridis color palettes in Power BI theme files

I am a fan of the viridis color palettes available in python and R, so I decided to make Power BI theme files for each of the 4 color maps (viridis, inferno, magma, plasma). These color palettes are not only lovely to look at, they are colorblind/CVD friendly and perceptually uniform (or close to it).

The screenshots below show the colors you’ll get when you use my theme files.

Viridis

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the viridis color palette which range from dark purple to blue to green.
Viridis theme colors in Power BI

Plasma

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the plasma color palette which range from dark purple to pink to yellow.
Plasma theme colors in Power BI

Magma

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the magma color palette which range from dark purple to pink to orange.
Magma theme colors in Power BI

Inferno

The Power BI color picker for a data colors in a column chart. It shows white, black, and then the 8 colors from the inferno color palette which range from dark purple to red to yellow
Inferno theme colors in Power BI

I generated a palette of 10 colors and then dropped the darkest and lightest colors in an effort to try to help you get good color contrast without inadvertently highlighting a data point. I chose to use the second darkest color of the 8 as the first/main color, which should work well on light backgrounds.

You’ll also notice that I have set in the theme the minimum, center, and maximum colors for use in a diverging color palette. This diverging palette includes the darkest and lightest color in an effort to give you a wider scale.

Give the themes a try

If you don’t enjoy choosing colors and just want something that looks good, feel free to hop over to the Github project and download the JSON files. You can learn more about the method I used to choose the colors and my suggestions for usage in the project documentation.

If you do use the themes, feel free to let me know how they worked and if you have suggestions for improvements.

Microsoft Technologies, Power BI, Power Query

Calling the Intercom API with Power Query and Refreshing in the Power BI Service

I needed to pull some user data for an app that uses Intercom. While I will probably import the data using Data Factory or a function in the long term, I needed to pull some quick data in a refreshable manner to combine with other data already available in Power BI.

I faced two challenges in getting this code to work:

  1. Intercom’s API uses cursor-based pagination when retrieving contacts
  2. I needed this query to be refreshable in PowerBI.com so I could schedule a daily refresh.

If you have worked with the Web.Contents function in Power Query, you may be familiar with all the various ways you can use it that aren’t supported in a refresh on PowerBI.com. Chris Webb’s blog is a great source for this info, if you find yourself stuck with a query that works in Power BI Desktop but not in the service.

The Intercom API

In version 2.4 of the Intercom API, users are a type of contact. You must make an HTTP GET call to https://api.intercom.io/contacts and pass an access token and Accept:application/json in the headers.

In the query string, you can specify the number of contacts per page by specifying per_page=x where x is a number less than or equal to 150. If you have more than 150 contacts, you will need to handle the cursor-based pagination.

When you make the initial call, the API will return a JSON object that contains a total count of contacts and a record for pages. Expanding the pages record shows the current page, the number of contacts per page, and the total number of pages.

type: pages
next: Record
page: 1
per_page: 150
total_pages: 3
Data returned in the Power Query Editor when retrieving contacts from the Intercom API

Expanding the next record gives you page 2 with a starting_after ID.

The cursor used for pagination in the Intercom API as retrieved using Power Query

To get the next page of contacts, the API call would be https://api.intercom.io/contacts?per_page=150&starting_after=[the starting_after ID listed above].

The Power Query Queries

This blog post from Gil Raviv gave me some ideas where to start, but the code in that blog will not refresh in PowerBI.com. You cannot put the iterations and the Web.Contents call and the generated list all in one query if you want to use scheduled refresh.

I ended up creating one query and two functions to accomplish my goal. The second function is optional, but you may find it useful as the time values in the API response are listed as Unix timestamps and you probably want to convert them to datetime values.

The first function contains the API call with an input parameter for the starting_after ID.

//Web API call function
(startafter) as record =>
   let
   Source = Json.Document(Web.Contents("https://api.intercom.io/contacts",[Query=[per_page="150",starting_after=startafter],Headers=[Accept="application/json", Authorization="Bearer <your access token goes here>"]])),
   data = try Source[data] otherwise null,
    pages = Source[pages],
    ttlpages = pages[total_pages],
    nextkey = pages[next][starting_after],
    next = try nextkey otherwise null,
    res = [Data=data, Next=next,TotalPages = total_pages]
   in
    res

It’s important to make the url in the Web.Contents function be a static string. If it is concatenated or dynamic in any way, the query will not be able to refresh in the Power BI service. All the query string parameters can go in the Query arguments of the Web.Contents function. If you have multiple query string arguments, you can put them in brackets with a comma separating them. You can do the same with multiple headers.

This function attempts the API call and returns null if it encounters an error. Once the data is returned, it retrieves the specific JSON objects needed to return the data. The next two lines are used to retrieve the starting_after value. The function returns the contact data, starting_after value, and total_pages value.

I used the following query to call that function.

let
GeneratedList = List.Generate(
   ()=>[i=0, res = fnGetOnePage("")],
   each [res][Data]<>null,
   each [i=[i]+1, res = fnGetOnePage([res][Next])],
   each [res][Data]),
    #"Converted to Table" = Table.FromList(GeneratedList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"),
    #"Expanded Column2" = Table.ExpandRecordColumn(#"Expanded Column1", "Column1", {"id", "workspace_id", "external_id", "role", "email", "name", "has_hard_bounced", "marked_email_as_spam", "unsubscribed_from_emails", "created_at", "updated_at", "signed_up_at", "last_seen_at", "last_replied_at", "last_contacted_at", "last_email_opened_at", "last_email_clicked_at", "browser", "browser_version", "browser_language", "os", "location", "custom_attributes", "tags", "notes", "companies", "opted_out_subscription_types"}, {"id", "workspace_id", "external_id", "role", "email", "name", "has_hard_bounced", "marked_email_as_spam", "unsubscribed_from_emails", "created_at", "updated_at", "signed_up_at", "last_seen_at", "last_replied_at", "last_contacted_at", "last_email_opened_at", "last_email_clicked_at", "browser", "browser_version", "browser_language", "os", "location", "custom_attributes", "tags", "notes", "companies", "opted_out_subscription_types"}),
    #"Expanded location" = Table.ExpandRecordColumn(#"Expanded Column2", "location", {"type", "country", "region", "city", "country_code"}, {"location.type", "location.country", "location.region", "location.city", "location.country_code"}),
    #"Expanded custom_attributes" = Table.ExpandRecordColumn(#"Expanded location", "custom_attributes", {"role", "brand", "Subdomain"}, {"custom_attributes.role", "custom_attributes.brand", "custom_attributes.Subdomain"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded custom_attributes",{"opted_out_subscription_types", "tags", "notes", "companies", "role"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Columns",{{"id", type text}, {"workspace_id", type text}, {"external_id", type text}, {"email", type text}, {"name", type text}, {"has_hard_bounced", type logical}, {"marked_email_as_spam", type logical}, {"unsubscribed_from_emails", type logical}, {"created_at", Int64.Type}, {"updated_at", Int64.Type}, {"signed_up_at", Int64.Type}, {"last_seen_at", Int64.Type}, {"last_replied_at", Int64.Type}, {"last_contacted_at", Int64.Type}, {"last_email_opened_at", Int64.Type}, {"last_email_clicked_at", Int64.Type}}),
    #"Replace Null with 0" = Table.ReplaceValue(#"Changed Type",null,0,Replacer.ReplaceValue,{"created_at","updated_at","signed_up_at","last_replied_at","last_seen_at","last_contacted_at","last_email_opened_at","last_email_clicked_at"
}),
    #"Invoked Custom Function" = Table.AddColumn(#"Replace Null with 0", "created_at_dt", each fnUnixToDateTime([created_at])),
    #"Invoked Custom Function1" = Table.AddColumn(#"Invoked Custom Function", "updated_at_dt", each fnUnixToDateTime([updated_at])),
    #"Invoked Custom Function2" = Table.AddColumn(#"Invoked Custom Function1", "signed_up_at_dt", each fnUnixToDateTime([signed_up_at])),
    #"Invoked Custom Function3" = Table.AddColumn(#"Invoked Custom Function2", "last_seen_at_dt", each fnUnixToDateTime([last_seen_at])),
    #"Invoked Custom Function4" = Table.AddColumn(#"Invoked Custom Function3", "last_replied_at_dt", each fnUnixToDateTime([last_replied_at])),
    #"Invoked Custom Function5" = Table.AddColumn(#"Invoked Custom Function4", "last_contacted_at_dt", each fnUnixToDateTime([last_contacted_at])),
    #"Invoked Custom Function6" = Table.AddColumn(#"Invoked Custom Function5", "last_email_opened_at_dt", each fnUnixToDateTime([last_email_opened_at])),
    #"Invoked Custom Function7" = Table.AddColumn(#"Invoked Custom Function6", "last_email_clicked_at_dt", each fnUnixToDateTime([last_email_clicked_at])),
    #"Fix Null Values" = Table.ReplaceValue(#"Invoked Custom Function7",DateTime.From("1970-01-01"),null,Replacer.ReplaceValue,{"created_at_dt","updated_at_dt","signed_up_at_dt","last_replied_at_dt","last_seen_at_dt","last_contacted_at_dt","last_email_opened_at_dt","last_email_clicked_at_dt"
}),
    #"Removed Columns1" = Table.RemoveColumns(#"Fix Null Values",{"created_at", "updated_at", "signed_up_at", "last_seen_at", "last_replied_at", "last_contacted_at", "last_email_opened_at", "last_email_clicked_at"})
in
    #"Removed Columns1"

The List.Generate call generates the rows I need to call my first function. It sets an iterator variable to 0 and then calls the function, which returns the first page of data along with the total pages and the starting_after ID. As long as data is returned, it makes the API call again with the previously returned starting_after ID. This creates a list of lists that can be converted into a table of records. Then the records can be expanded to fields.

I expanded several columns out into multiple columns. Then I adjusted the data types of my columns to the correct types (they came back as Any data type).

There were several columns that contained Unix timestamps. All of the custom function calls are returning the datetime version of those values. I needed to handle null values in the timestamp conversion, so I replaced all the null timestamps with 0, converted them, and then converted the datetime value of 1-Jan-1970 back to null. I did the replace for all 7 columns in one step. Then I removed the original columns that contained the Unix timestamp as they were not analytically relevant for me.

Below is my Unix timestamp to datetime conversion function.

(UnixTime as number) as datetime=> 
let 
DT = #datetime(1970, 1, 1, 0, 0, 0) + #duration(0, 0, 0, UnixTime) 
in DT

Advice and warnings

When using an API key that must be passed in the headers, it is safest to use a custom connector. Otherwise, you have to embed your API key in the M code, as shown above. When the query is sent to Intercom, it is encrypted using HTTPS. But anyone that opens your PBIX file would have access to it. Your key will be captured in the Power BI logs. And anyone that can manage to intercept your web request and decrypt your traffic would have access to it. This is not ideal. But creating a custom connector requires more advanced code and a gateway to make it usable in the Power BI service. With either option, you will choose Anonymous authentication for the data source.

Be sure to use the RelativePath and Query options in the Web.Contents call. This is necessary to make the query refreshable in the service. The value passed to the first parameter of Web.Contents must be a static string and must be valid in itself (no errors returned).

After publishing your report, you’ll need to set the credentials for the dataset before you can refresh it. Be sure to check the Skip test connection box. Otherwise, your credentials update will fail.

url: https://api.intercom.io/contacts
Authentication method: anonymous
Privacy level: organizational 
Skip test connection: yes
The Skip test connection option for the web data source in the dataset settings in PowerBI.com

Even though we are using anonymous authentication, you can still choose an Organizational privacy level.

In my contacts list, if I want only app users from your Intercom contacts list, I needed to filter the results on role = “user” since contacts also includes leads. The Role attribute was in the custom attributes returned by the API.

It took a bit of experimentation to get this working, so I thought I would share what I found to work. As always, Power BI is always changing and a better way to do this may be available in the future.

If you would like this development experience to improve, here are some Power BI Ideas to vote for:

Azure, Microsoft Technologies, SQL Server

Check if File Exists Before Deploying SQL Script to Azure SQL Managed Instance in Azure Release Pipelines

I have been in Azure DevOps pipelines a lot recently, helping clients set up automated releases. Many of my clients are not in a place where automated build and deploy of their SQL databases makes sense, so they deploy using change scripts that are reviewed before deployment.

We chose release pipelines over the YAML pipelines because it was easier to manage for the client and pretty quick to set up. While I had done this before, I had a couple of new challenges:

  • I was deploying to an Azure SQL managed instance that had no public endpoint.
  • There were multiple databases for which there may or may not be a change script to execute in each release.

This took a bit longer than I expected, and I enlisted my coworker Bill to help me work through the final details.

Setup

There are a few pre-requisites to make this work.

Self-hosted Agent

First, you need to run the job on a self-hosted agent that is located on a virtual machine that has access to the managed instance. This requires downloading and installing the agent on the VM. We ended up installing this on the same VM where we have our self-hosted integration runtime for Azure Data Factory, at least for deployment from dev to test. This was ok because the VM had enough resources to support both, and deployments are not long or frequent.

When you create a new self-hosted agent in Azure DevOps, a modal dialog appears, which provides you with a link to download the agent and a command line script to install and configure the agent.

Dialog from Azure DevOps titled "Get the agent". There are options for Windows, macOS, and Linux. There is a button to copy a link to download the agent. Then there is a set of command line scripts to create the agent, and configure the agent.
Dialog showing the instructions to download and install the Azure Pipelines Self-hosted agent

We configured the agent to run as a service, allowing it to log on as Network Service. We then could validate it was running by opening the Services window and looking for Azure Pipelines Agent.

You must add an inbound firewall rule for the VM to allow the traffic from the IP addresses related to dev.azure.com. This can be done in the Azure portal on the Networking settings for the VM.

Then you must install the SqlServer module for PowerShell on the VM. And you must also install sqlpackage on the VM. We also updated the System Path variable to include the path to sqlpackage (which currently gets installed at C:\Program Files\Microsoft SQL Server\160\DAC\bin).

DevOps Permissions

Azure DevOps has a build agent with a service principal for each project. That service principal must have Contribute permissions on the repo in order to follow my steps.

Database Permissions

If you want the Azure DevOps build agent to execute a SQL script, the service principal must have appropriate permissions on the database(s). The service principal will likely need to create objects, execute stored procedures, write data, and read data. But this depends on the content of the change scripts you want to execute.

The Pipeline

In this project we had one pipeline to release some analytics databases and Azure Data Factory assets from dev to test.

Therefore, we included the latest version of the repos containing our database projects and data factory resources as artifacts in our pipeline. While the database objects and data factory resources were related, we decided they weren’t dependent upon each other. It was ok that one might fail and one succeed.

Azure DevOps release pipeline with 2 artifacts (database code and ADF code) and 2 stages (Database deployment and ADF Deployment).

Side note: I love the ADF Tools extension for Azure DevOps from SQL Player. It is currently my preferred way to deploy Azure Data Factory. I find it is a much better experience than ARM template deployment. So that is what I used here in this ADF Deploy job.

Repo Setup

Getting back to databases, we had 3 databases to which we might want to deploy changes. We decided that we would set a standard of having one change script per database. We have an Azure DevOps repo that contains one solution with 3 database projects. These databases are highly related to each other, so the decision was made to keep them all in one repo. I added a folder to the repo called ChangeScripts. ChangeScripts contains two subfolders: Current and Previous. When we are ready to deploy database changes, we add a .sql file with the name of the database into the Current folder. For any deployment, there may be 1, 2, or 3 change scripts (1 for each database that has changes). When the release pipeline runs, the change scripts in the Current folder are executed and the files are renamed and moved into the Previous folder. This is why my DevOps service principal needed Contribute permissions on the repo.

Here are the tasks in my Database Deployment job:

  1. Check if SQL Scripts exist in the expected location.
  2. If a script exists for Database1, execute the script.
  3. If a script exists for Database 2, execute the script.
  4. If a script exists for Database 3, execute the script.
  5. Rename the change scripts to fit the pattern [DatabaseName][yyyyMMddHHmm].sql and move them to the Previous folder.

Configuring the Tasks

The first thing we must do is set the agent job to run using the agent pool containing the self-hosted agent. There is another important setting under Additional options for Allow scripts to access the OAuth token. This allows that service principal to authenticate and execute the SQL scripts, so make sure that is checked.

My first task, called Check for SQL Scripts is a PowerShell task. I used an inline script, but you can also reference a file path that is accessible by the agent. My script looks for the three files and then sets a pipeline variable for as true or false to indicate the presence each file. This variable is used later in the process.

My task is configured with the working directory set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current. There are no environment variables or output variables.

Here’s my PowerShell script:

$DB1Script = "Database1.sql"
$DB2Script = "Database2.sql"
$DB3Script = "Database3.sql"

$IsDB1Script = Test-Path -Path $DB1Script -PathType Leaf
$IsDB2Script =  Test-Path -Path $DB2Script -PathType Leaf
$IsDB3Script = Test-Path -Path $DB3Script -PathType Leaf

Write-Host $IsDB1Script  $IsDB2Script $IsDB3Script

Write-Output ("##vso[task.setvariable variable=DB1Script;]$IsDB1Script")

Write-Output ("##vso[task.setvariable variable=DB2Script;]$IsDB2Script")

Write-Output ("##vso[task.setvariable variable=DB3Script;]$IsDB3Script")

We got stuck on the syntax for setting variables for a while. Most of the things we found in Microsoft Docs and on blogs were close, but didn’t work. Finally, we found this blog, which showed us the syntax that worked – be careful with the quotes, semi-colon, and parentheses.

My second task executes the script for Database1. It uses the Azure SQL Database deployment task. When you use this task, you can specify whether it deploys a dacpac or executes a SQL script. We chose to execute a SQL script that we stored in the same repo as the database projects. For this task, you must specify the Azure Subscription, the target server, and the target database. I set my authentication type to Service Principal, since I had granted access to the Azure DevOps service principal during setup. My Deploy type was set to SQL Script File. And my SQL Script location was set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current/Database1.sql. An important configuration on this task is to set Run this task to Custom Conditions. I specified my custom condition as eq(variables['DB1Script'], 'True'). This tells the task to execute only if the variable we set in the previous task is equal to True. It seems data types don’t really matter here – I set the variable in the PowerShell script as a boolean, but it was converted to a string in Azure DevOps.

My third and fourth tasks do the same activities for Database2 and Database3.

My last task, called Move Change Scripts to Previous Folder, is another inline PowerShell script. The script uses the variables set in the first task to determine if the file exists and needs to be moved. Because the script is working on a local copy on the VM, it must commit the changes back to the git repo after renaming and moving the file(s). The Working Directory for this task is set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts.

Here’s the code:

# Write your PowerShell commands here.

Write-Host "Getting Files"
$path = Get-Location
Write-Host $path


$File1 = "Current/Database1.sql"
$File2 = "Current/Database2.sql"
$File3 = "Current/Database3.sql"

$IsDB1Script = Test-Path -Path $File1 -PathType Leaf
$IsDB2Script = Test-Path -Path $File2 -PathType Leaf
$IsDB3Script = Test-Path -Path $File3 -PathType Leaf

$DateString = Get-Date -Format "yyyyMMddHHmm"

$NewFile1 = "Previous/Database1$($DateString).sql"
$NewFile2 = "Previous/Database2$($DateString).sql"
$NewFile3 = "Previous/Database3$($DateString).sql"


 Write-Host "Changing file locations"
    git config --global user.email "myemail@domain.com"
    git config --global user.name "My Release Pipeline"

   
if($IsDB1Script) 
{
    $File1 |  git mv ($File1) ($NewFile1) 
    Write-Host "Moved Database1 script"
}

if($IsDB2Script)
{
    $File2 |  git mv ($File2) ($NewFile2) 
  Write-Host "Moved Database2 script"
}

if($IsDB3Script)
{
  $File3 |  git mv ($File3) ($NewFile3) 
  Write-Host "Moved Database3 script"
}

git commit -a -m 'Move executed change scripts to Previous Folder'
git push origin HEAD:master

Anything you output from your PowerShell script using Write-Host is saved in the release logs. This is helpful for troubleshooting if something goes wrong.

As you can see above, you can execute git commands from PowerShell. This Stack Overflow question helped me understand the syntax.

And with those five steps, I’m able to execute change scripts for any or all of my three databases. We don’t anticipate adding many more databases, so this configuration is easily manageable for the situation. I wouldn’t do this (one step per database) for a situation in which I have dozens or hundreds of databases that may be created or deleted at any time.