Azure, Databricks, Microsoft Technologies, Power BI, Power Query, Unity Catalog

Power Query ODBC bug affecting date calculations

I was working on an imported Power BI semantic model, adding some fiscal year calculations to my date table. The date table was sourced from a view in Databricks Unity Catalog. I didn’t have access to add more fields to the view, so I was adding the fields in Power Query first, with plans to request they be added to the view in the future. I got some unexpected results, which turned into a bug being logged for the ODBC code for Power Query.

If you are only analyzing data in the last 20 years, you won’t see this bug. But if you are doing long-term analysis including years before 2000, you might just run into it.

The Scenario

I encountered the bug when I was adding fiscal start of year and fiscal end of year columns. I used the following formula:
Date.AddMonths(Date.StartOfYear([The Date]), FiscalYearEndMonth - 12)

For the current year, I was getting correct results. For example, my fiscal year end month was 8, so for the date 1 April 2024, I was expecting the result to be 1 September 2023. But for years prior to 2000, the results were off by one for every date in the year except the first day of the year.

I isolated the issue to the Date.StartOfYear function. Then I created a simplified reproduction by making a view with a column of dates using the date data type in Unity Catalog. I populated the view with 6 dates, as shown below. I then created a custom column with the formula Date.StartOfYear([TestDate]).

A table in the Power Query Editor showing that it calculated the start of year for 2 Jan 1990 as 1 Jan 1991.
My Power Query custom column added to a table sourced from Databricks with 6 rows of dates producing incorrect results

Power Query could correctly calculate the year, but it could not calculate the first or last date of the year.

I tried a couple of variations on this test. Changing the data type of the TestDate column to timestamp in Unity Catalog leads to a Date/Time column in Power Query, which produces the same incorrect results. I then tried the same calculations with a view built in Azure SQL Database. This produced correct results! Manually pasting data into a table also produced correct results.

I sent this info to Microsoft, and they confirmed there is a bug in the ODBC code used to query Databricks related to calculating differences in dates. Because this custom column folded back into the query executed in Databricks, the results were calculated in the Databricks engine rather than the Power Query engine.

Issue Summary

Where: Power Query ODBC code used to query Databricks

What’s affected: A handful of date calculations including Date.StartOfYear and Date.EndOfYear when used on date values prior to 1 Jan 2000.

Impact: Low – most people are not doing analysis on data prior to 2000 using an imported table in a Power BI semantic model with a Databricks source.

Workarounds

I found 3 workarounds that can be used:

  1. Move the calculation further upstream into Databricks (this is generally a good idea and follows Roche’s Maxim of Data Transformation).
  2. Use the Table.StopFolding function to keep the date calculation from being performed in the source engine.
  3. Change the data type of the date column from the source data to Date/Time/Timezone in Power Query, then perform the date calculations.

Option 1 is the best solution, but we don’t always have control over our source data. If you can update your source table/view, please do that and don’t bother worrying about this bug.

Option 2 requires you to use the Advanced Editor to modify your M code, but it’s a quick 1-minute change. Add a new line after importing the data but before performing the date calculation. Reference the output from the previous step in the Table.StopFolding function. Change the subsequent line to reference the output of your new step.

let
    Source = Databricks.Catalogs("adb-6021692154917656.16.azuredatabricks.net", "sql/protocolv1/o/6021692154917656/1202-172413-pp03vcis", [Catalog = "", Database = ""]),
    main_Database = Source{[Name="main",Kind="Database"]}[Data],
    default_Schema = main_Database{[Name="default",Kind="Schema"]}[Data],
    datetest_View = default_Schema{[Name="datetest",Kind="View"]}[Data],
    PreventQueryFolding = Table.StopFolding(datetest_View), 
    #"Inserted Start of Year" = Table.AddColumn(PreventQueryFolding, "Start of Year", each Date.StartOfYear([TestDate]), type date),
    #"Inserted Year" = Table.AddColumn(#"Inserted Start of Year", "Year", each Date.Year([TestDate]), Int64.Type)
in
    #"Inserted Year"
Example of using the Table.StopFolding function to work around the date calculation issue

Option 2 works fine on small tables, but I wouldn’t advise disabling query folding on a very large table just to make the date math work. Hopefully, you have a nice star schema with a (somewhat small) date dimension instead of trying to do the date math in a large fact table or a large single-table model.

If you don’t feel comfortable using the Advanced Editor, changing the data type of the source date column from date or date/time to date/time/timezone in Power Query (before you do the date calculations) will achieve the same result. You can change the data type back in a subsequent step after performing the date calculations. This is my least favorite option, but it does achieve the goal and is probably more beginner-friendly than the other options. Once again, this is not ideal on a large table.

Conclusion

As I stated above, this bug is a bit obscure, only affecting older dates in data sourced from non-SQL Server ODBC connections. But I wanted to share some workarounds on the off chance someone else encountered it.

If you have run into this, I’d love to hear about it in the comments.

Accessibility, Data Visualization, Microsoft Technologies, Power BI

Your gradient fill bar charts in Power BI have poor color contrast, but you can fix them

Since conditional formatting was released for Power BI, I have seen countless examples of bar charts that have a gradient color fill. If you aren’t careful about the gradient colors (maybe you just used the default colors), you will end up with poor color contrast. Luckily there are a couple of quick (less than 30 seconds for most people to implement) fixes that can improve your color contrast.

Check out this video to get a more detailed explanation of the problem and potential solutions.

Meagan’s 10-minute video on color contrast issues with gradient color fill in Power BI bar charts shows how to identify and resolve the color contrast issues

If you aren’t the type to watch a longer video, here are my tips:

  1. If you are using the same measure for both the gradient and the bar length, consider whether you even need to have the gradient coloring. It’s not adding information.
  2. If you do need/want the gradient, either because you are encoding a different measure or you just can’t bear to part with it, consider adding a border around the individual bars. Make sure the color contrast from the background and bar colors meet the desired color contrast ratio (usually 3:1).

Were you aware we can now put borders on the individual bars in our bar charts? If you’d like some practice with this formatting option, check out the Workout Wednesday for Power BI exercise from 2023 Week 52.

DAX, Microsoft Technologies, Power BI

Switching between different active physical relationships in a Power BI model

A couple of weeks ago, I encountered a DAX question that I had not previously considered. They had a situation where there were two paths between two tables: on direct between a fact and dimension and another that went through a different dimension and a bridge table. This could happen in many scenarios:

  • Sales: There is a salesperson who is directly responsible for a sale. Sales leadership in various positions also get credit for that sale based upon the client or region to which the sale is associated, and this may logically be a many:many relationship (multiple people get credit for the same sale and a person gets credit for multiple sales).
  • Consulting: Hours are worked and billed by a consultant, but the team lead(s) or tech lead(s) associated with each project also get some sort of credit or responsibility for all hours or invoices on a project.
  • Law firms: An attorney works directly on a matter and bills for that time. Billing attorneys approve the timesheets or other attorneys get some sort of credit/commission on the work based upon a relationship to the matter.

This scenario creates a data model similar to the below (I went with the sales example here). There is a direct relationship between Sales and Employee. On the sale, that related employee is the person who directly made the sale. But there are others who are associated with that sale because they have some sort of ownership of the client relationship. In this situation, it’s not just the manager of the salesperson – there could be a matrix where several people are involved. To accommodate this, we have a bridge table that relates the client dimension with the employee, and role dimensions. The role dimension might define the role as sales region lead or sales VP or sales product lead, and there might be multiple people in each role.

a Power BI model with 5 tables: role, bridge, client, employee, and sales. Role is related to Bridge. Bridge is related to client and employee. Employee is related to Sales. Client is related to Sales.

It’s very common to have role-playing dimensions, where the fact table has multiple columns that can relate to the target dimension. We set one as active and the other as inactive, and we can then use the USERELATIONSHIP() function with CALCULATE().

But this situation is different because it’s not about two direct relationships between two tables. In one path, we are going directly between Sales and Employee. In another path, we go from Sales through Client and Bridge to Employee. As you can see in the image, Power BI allows both of these relationships to be active at the same time. We can still use DAX to switch between the two, but we need to use different functions.

The Data

In my sales table, I have 4 rows:

a table showing 4 rows of data with columns: Sales Worker Emp ID, Sale Amount, Client. There are 2 clients and two employees in the data.
Sales fact table in the Power BI model

There are 4 sales. Two sales are related to client 1 and two are related to client 2. Employee 1 made $15 in sales. Employee 2 made $20 in sales.

By default, if I create a table with the sales amount and the employee, it will use the direct relationship between Sales and Employee.

A table visual that shows Alice has $15 in sales and Bob has $20 in sales.
Table visual in Power BI that uses the direct relationship between Employee and Sales tables

Employee 3 and Employee 4 have relationships to both Client 1 and Client 2 in the bridge table.

The bridge table contains 4 rows of data. Employees 3 and 4 are related to both clients 1 and 2.
Bridge table relating employees to clients in Power BI

The bridge table has a bidirectional relationship with the Client table, and the Employee table has a single-direction 1:many relationship with the bridge table. When we use this path between the tables, we expect Employee 3 and Employee 4 to each have associated sale amounts of $35.

A table visual showing Sue and Jerry both having $35 in sales.
Table visual in Power BI that uses the the bridge table

The DAX

I originally wasn’t sure if this could be done. I tried using TREATAS() to switch relationships, but that wasn’t enough to ignore a physical relationship. But I asked around and Ed Hansberry provided a solution using CROSSFILTER().

I started by creating a base measure:

Amt = Sum('Sales'[Sale Amount])

I created the following measure to use the bridge table:

Amt Bridge = CALCULATE([Amt],CROSSFILTER('Employee'[Employee ID], 'Sales'[Sales Worker Emp ID], None))

While I could have just used the [Amt] column for the direct relationship, I tested a measure that is more explicit in ignoring the relationship between the Employee and Bridge tables.

Amt Dir = CALCULATE([Amt], CROSSFILTER('Bridge'[Employee ID], Employee[Employee ID], NONE))

Again, this produces the same result as just using the original [Amt] measure.

CROSSFILTER() was exactly what I needed to accomplish my goal. Setting the CrossFilterType argument to None allows me to ignore an existing active physical relationship when used as a filter inside of CALCULATE().

The two table visuals shown next to each other where the direct relationship to sales and the indirect relationship to sales are shown on one report page in Power BI.
Azure, Azure Data Factory, Azure SQL DB, Azure Storage, Microsoft Technologies, PowerShell

Update Azure SQL database and storage account public endpoint firewalls with Data Factory IP ranges

While a private endpoint and vNets are preferred, sometimes we need to configure Azure SQL Database or Azure Storage to allow use of public endpoints. In that case, an IP-based firewall is used to prevent traffic from unauthorized locations. But Azure Data Factory’s Azure Integration Runtimes do not have a single static IP. So how do we keep the firewall updated so that ADF can access these resources?

One option is to run everything through a self-hosted integration runtime, which can have a static public IP. But that also means you have to install (and keep updated) a Java SDK if you are converting data to Parquet or ORC files.

Another option is to obtain the IP range list published by Microsoft and update the firewall. That is what I did for a recent project.

The list of IP ranges is published weekly as a JSON file. New ranges appearing in the file will not be used in Azure for at least one week. The file name is in the format “ServiceTags_Public_<YYYYmmdd>.json”. It appears that the date in the file always references the Monday of the week of publication.

This means we can use an automation tool to get the information we want from the file each week and update our firewall rules. The Az PowerShell library makes this task achievable with a few lines of code.

Modifying firewall IP ranges with PowerShell

The full script is here. The following is an explanation of each section.

To make my PowerShell script more reusable, I added parameters.

# Input bindings passed in via param block

Param
(
    [Parameter (Mandatory= $false)]
    [String] $storageRG = "storageRG",

    [Parameter (Mandatory= $false)]
    [String] $sqlRG = "sqlRG",

    [Parameter (Mandatory= $false)]
    [String] $storageacctname = "stgacct",

    [Parameter (Mandatory= $false)]
    [String] $sqlservername = "sqldev",

    [Parameter (Mandatory= $false)]
    [String] $region = "EastUS",

    [Parameter (Mandatory= $false)]
    [String] $subscription = "mysub"
)

This script logs into Azure using the managed identity of the Azure Automation account. This account needs contributor access on the SQL Server in order to modify the firewall rules. It also needs Storage Account Contributor on the storage account to modify the firewall rules there.


# Import needed libraries
Import-Module Az.Storage
Import-Module Az.Sql
Import-Module Az.Resources

# Ensures you do not inherit an AzContext in your runbook
Disable-AzContextAutosave -Scope Process

# Connect to Azure with system-assigned managed identity
$AzureContext = (Connect-AzAccount -Identity).context

# Set and store context
$AzureContext = Set-AzContext -SubscriptionName $AzureContext.Subscription -DefaultProfile $AzureContext
$sub = Select-azSubscription -SubscriptionName $subscription
Write-Output "Connected to Azure"

To download the file containing the IP service tag IP ranges, we need to dynamically generate the correct file name and then call Invoke-RestMethod. Then we can filter the IP ranges down to only the data factory IPs in our specified region

# Download the service tags for Azure services

$last_week = (Get-Date)

while ($last_week.DayOfWeek -ne 'Monday') {
$last_week = $last_week.AddDays(-1)
}

$last_week = $last_week.ToString("yyyyMMdd")

$url = 'https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_' + $last_week + '.json'

$ip_ranges = Invoke-RestMethod $url -Method 'GET'

# Filter to only ADF in specified region

$STagName = "DataFactory.${region}"

$address_prefixes = $ip_ranges.values `| Where-Object {$_.name -eq $STagName} `| Select-Object -ExpandProperty properties ` | Select-Object -ExpandProperty addressPrefixes

$address_prefixes = $address_prefixes `
| Where-Object { $_ -notmatch ":" } `
| ForEach-Object { @{ ipAddressOrRange = $_ } }

$address_prefixes
Write-Output "Latest IP ranges acquired"

The downloaded file lists IP ranges for all the service tags, both for overall Azure and for services in specific regions. The service tag we care about is called DataFactory.{region}. So if your data factory is in Central US, you would look for that service tag. Under DataFactory.CentralUS in the screenshot below, we see the array of addressPrefixes. This is the info we need.

I’m working under the assumption that everyone in the organization should be accessing the storage account and SQL server while on the corporate VPN, so I know that anything that doesn’t come from that IP range from ADF is unnecessary. So I’m deleting all the storage firewall rules and SQL firewall rules and then repopulating them.

#delete storage firewall rules

$SIPs = (Get-AzStorageAccountNetworkRuleSet `
   -ResourceGroupName $storageRG `
   -Name $storageacctname).IpRules.IPAddressOrRange

foreach ($IPR in $SIPs) {
Remove-AzStorageAccountNetworkRule `
   -ResourceGroupName $storageRG `
   -Name $storageacctname `
   -IPAddressOrRange $IPR
}
Write-Output "Storage firewall rules removed"


#delete sql firewall rules

$FRules = Get-AzSqlServerFirewallRule `
   -ResourceGroupName $sqlRG `
   -ServerName $sqlservername

foreach ($FRule in $FRules) {
Write-Output "Removing " + $Frule.FirewallRuleName
Remove-AzSqlServerFirewallRule -ServerName $sqlservername `
    -ResourceGroupName $sqlRG `
    -FirewallRuleName $Frule.FirewallRuleName
}
Write-Output "SQL firewall rules removed"

A couple of things to note: Storage account firewall rules do not have a name, and they accept CIDR notation. Azure SQL firewall rules have a name and expect an IPv4 range. When we delete storage account firewall rules, we must provide the resource group, storage account name, and IP address/range. When we delete SQL firewall rules, we must provide the resource group, server name, and firewall rule name. But the approach is the same for storage and SQL server: get the list of existing rules, loop through and delete each one.

To add the ADF IP ranges, I updated both the storage account and SQL server in the same ForEach loop. You can do them in separate loops if you would like to make sure you complete one and then the other.

$addrct = 0
foreach ($address_prefix in $address_prefixes.values) {

# Add rule to storage account firewall
Add-AzStorageAccountNetworkRule `
   -ResourceGroupName $storageRG `
   -Name $storageacctname `   -IPAddressOrRange $address_prefix

# Add rule to sql server firewall
$addrct = $addrct + 1
$RuleName = "ADF Rule " + $addrct.ToString()

#Convert CIDR to IPV4 start and end
$StrNetworkAddress = ($address_prefix.split("/"))[0]
$NetworkIP = `
   ([System.Net.IPAddress]$StrNetworkAddress).GetAddressBytes()
[Array]::Reverse($NetworkIP)
$NetworkIP = ([System.Net.IPAddress]($NetworkIP `
    -join ".")).Address
$StartIP = $NetworkIP

If (($StartIP.Gettype()).Name -ine "double")
{
$StartIP = [Convert]::ToDouble($StartIP)
}
$StartIP = [System.Net.IPAddress]$StartIP

[int]$NetworkLength = ($address_prefix.split("/"))[1]
$IPLength = 32-$NetworkLength
$NumberOfIPs = ([System.Math]::Pow(2, $IPLength)) -1
$EndIP = $NetworkIP + $NumberOfIPs

If (($EndIP.Gettype()).Name -ine "double")
{
$EndIP = [Convert]::ToDouble($EndIP)
}
$EndIP = [System.Net.IPAddress]$EndIP

New-AzSqlServerFirewallRule -ResourceGroupName $sqlRG `
    -ServerName $sqlservername `
    -FirewallRuleName $RuleName `
    -StartIpAddress $StartIP.ToString() `
    -EndIpAddress $EndIP.ToString()
}

As stated earlier, the storage account firewall accepts CIDR notation, so we can quickly loop through each range we retrieved from the file and add it to the storage account firewall.

Then we need to convert each range to a start and end IP for SQL server. We get the first address in the CIDR range and make that our start IP. In many scripts you will see people add 1 to this address to get the range start, but I tested this, and it actually needs this first IP address to be present and not incremented by 1. To get the end IP, we get the number after the slash, subtract it from 32, take 2 to the power of that resulting number and subtract 1, and add it to the start address.

Azure SQL Server firewall rules require a rule name, so I named mine “ADF Rule {#}” where the number is a variable that is incremented by 1 in each execution of the loop. My rule name for the first firewall rule is “ADF Rule 1”. You could do other things with this, such as adding the date in the name of the rule.

Once I have added the ADF-related firewall rules, I go back and add the static IP addresses for the organization.

#Static IPs 
$CorpIP = "X.X.X.X"
$CorpIP2 = "X.X.X.X"

#re-add static IP to Storage
Add-AzStorageAccountNetworkRule `
  -ResourceGroupName $storageRG `
  -Name $storageacctname `
  -IPAddressOrRange $CorpIP

Add-AzStorageAccountNetworkRule `
  -ResourceGroupName $storageRG `
  -Name $storageacctname `
  -IPAddressOrRange $CorpIP2

#re-add static IP to SQL

New-AzSqlServerFirewallRule `
  -ResourceGroupName $sqlRG `
  -ServerName $sqlservername `
  -FirewallRuleName "CorpIP" `
  -StartIpAddress $CorpIP `
  -EndIpAddress $CorpIP

New-AzSqlServerFirewallRule `
  -ResourceGroupName $sqlRG `
  -ServerName $sqlservername `
  -FirewallRuleName "CorpIP2" `
  -StartIpAddress $CorpIP2 `
  -EndIpAddress $CorpIP2

If you know you have more than 2 simple IP addresses to add back at the end, you could create an array of IP addresses and loop through them. I decided to keep my script simple here, but you should do what balances reusability with maintainability in your scenario.

You can get the full script on GitHub.

Thanks to Prashanth Kumar and Tao Yang for making their code available on their blogs, which helped me create my solution tailored for ADF IP ranges.

Azure, Databricks, Microsoft Technologies, Unity Catalog

Databricks Unity Catalog primary key and foreign key constraints are not enforced

I’ve been building lakehouses using Databricks Unity catalog for a couple of clients. Overall, I like the technology, but there are a few things to get used to. This includes the fact that primary key and foreign key constraints are informational only and not enforced.

If you come from a relational database background, this unenforced constraint may bother you a bit as you may be used to enforcing it to help with referential integrity. It is still otherwise useful when paired with data exploration or analytics tools that can detect the constraints and use them to help a user write a query or build a semantic model.

Let’s look at an example of the consequences of primary key constraints not being enforced.

First, let’s create a new catalog and schema in a Unity Catalog metastore.

Create Catalog devcatalog1;
Use catalog devcatalog1;
Create schema myschema;

Next, we create a table with a primary key constraint and insert two rows.

CREATE TABLE IF NOT EXISTS devcatalog1.myschema.table1 
 (id Int NOT NULL, columnB String, CONSTRAINT table1_pk Primary Key(id));

INSERT INTO TABLE devcatalog1.myschema.table1 
VALUES
  (1, "one"),
  (2, "two");

Then we can create a second table that has a foreign key constraint that goes back to the first table.

CREATE TABLE IF NOT EXISTS devcatalog1.myschema.table2
(table2id INT NOT NULL Primary Key, table1id INT NOT NULL, EventDate date NOT NULL,  
CONSTRAINT table2_table1_fk FOREIGN KEY(table1id) REFERENCES devcatalog1.myschema.table1);

INSERT INTO TABLE devcatalog1.myschema.table2
VALUES 
(1, 1, '2023-06-15'), 
(2, 1, '2023-06-15'), 
(3, 2, '2023-06-15');

If we then insert a new row into table1 that has an id value of two, we are violating our primary key constraint. But nothing happens (no error is thrown) because it is unenforced.

INSERT INTO TABLE devcatalog1.myschema.table1 
values (2, 'three');

Querying table1 shows our duplicate values in the id column.

Now we don’t know which row the foreign key constraint in table2 was supposed to reference.

What do we do now?

Unenforced constraints are not the end of the world. Many people argue that referential integrity should be maintained in the business logic layer code that populates the tables rather than in the database engine. Enforced constraints in the database are a failsafe when the code that updates the tables fails to maintain that referential integrity. For now, we just have to make sure that the way we populate the tables does not allow us to violate these constraints, even when Unity Catalog doesn’t stop us. You may have to check whether rows exist before inserting new ones. And you may need data quality checks after batch loads to make sure something wasn’t missed. But you likely needed those things anyway. This is definitely something that can be worked around. And it’s likely an issue in many MPP systems, so it is something to think through and have a design pattern ready to handle.

Note that as of November 2023, primary key and foreign key constraints are in preview in Databricks, so there may be more enhancements to come.

Get the notebook

If you need to explain this concept to someone else, feel free to grab my example notebook from Github.

Databricks, Microsoft Technologies, Python

Parameterize your Databricks notebooks with widgets

Widgets provide a way to parameterize notebooks in Databricks. If you need to call the same process for different values, you can create widgets to allow you to pass the variable values into the notebook, making your notebook code more reusable. You can then refer to those values throughout the notebook.

Note: There are also ipywidgets, which are visual elements that allow users to specify parameter values in notebook cells. These are different and have a different purpose. Databricks widgets are used to pass values between notebooks or to a job. ipywidgets do not support this functionality.

There are 4 types of Databricks widgets:

  • text
  • dropdowns
  • combobox
  • multiselect

You can create widgets in a notebook using Python, Scala, R, and SQL. I typically use Python or SQL, so my examples will be constrained to those two languages.

For all widgets, you can specify a name, default value, and label. For widgets other than the text type, you can specify the list of allowed values.

Widget Creation

Let’s look at widget creation.

%python
# Make some widgets in Python
dbutils.widgets.text("ProcessName", "Process A", "Name of the calling process")
dbutils.widgets.dropdown("State", "Colorado", ["Colorado", "Missouri", "Kansas", "Nebraska", "Texas"])

The above Python code makes the widgets shown below. The label argument used in the ProcessName widget is optional.

Databricks notebook with two widgets at the top. One is a textbox with the label "Name of the calling process". The other is a dropdown box named "State" with 5 values and a default value set.

Here are some examples of widgets created in SQL.

%sql
Use catalog devcatalog1;
Use schema myschema;
CREATE WIDGET TEXT Datalake DEFAULT "";
CREATE WIDGET COMBOBOX Product DEFAULT 'Product A' CHOICES SELECT DISTINCT ProductName FROM Products LIMIT 3

The above SQL code makes the widgets shown below. Notice that I provided an empty string for the default in the Datalake text widget. In the combobox widget, I used a query to populate the default values.

Databricks notebook with two widgets at the top. One is a textbox with the name "Datalake". The other is a combo box named "Product" with 3 values shown and the user typing in the box.

When you first see the Product widget “Product A” is selected, but as soon as you click on it the default is cleared and you can type in whatever you want. The items shown will filter based upon what you type so you can choose one of them or submit your own value.

Once your widgets are created, you can delete/comment out the code you used to create them and the widgets will remain in place.

Configuring Widgets in Your Notebook

In the widgets panel in your notebook, there are some configurations you can change. You can click the Edit button and drag the corner of any widget to change the width. You can select the widget label to move the widget to a different position, which changes the order of the widgets. By default they are shown alphabetically by label (which defaults to name if no label is provided).

Changing the width and order of widgets in a Databricks notebook. The selected widget shows up with a pink highlight.

When you click the settings button, you can adjust what happens when you change a widget value. There are three options:

  • Run Notebook: Every time a new value is selected, the entire notebook is rerun.
  • Run Accessed Commands: Every time a new value is selected, only cells that retrieve the values for that particular widget are rerun. This is the default setting when you create a widget. SQL cells are not rerun in this configuration.
  • Do Nothing: Every time a new value is selected, nothing is rerun.

I tend to set my widgets to do nothing, especially when I am first building and testing them.

The settings button also allows you to select whether to keep your widgets visible at the top of the notebook, or allow them to stay in the position on the page, becoming hidden as you scroll further down.

Referencing Widgets

Once you have widgets, you can get the current value as shown below.

%python
# Get widget value Python
stateval = dbutils.widgets.get("State")
processval = dbutils.widgets.get("ProcessName")
print("State: " + stateval + " Process: " + processval)
%sql
-- Get widget value SQL
Select "${State}" as State, "${ProcessName}" as Process

In one project I have worked on, we have the same objects replicated across multiple schemas in Unity Catalog, and there is basically a schema per country. All of the countries are in one catalog, and the catalogs represent environments (e.g., dev, test, prod).

So we have a notebook with widgets that is run once for each country/schema. After retrieving the widget values and storing them in variables, we use them to set the catalog and schema and tell the notebook where to create a staging table in Unity Catalog.

%python
# Set the current schema.
catalogstr = "USE CATALOG " + catalogfilterv + ";"
spark.sql(catalogstr)
schemastring = "Use schema " + schemafilterv
#print(schemastring)
spark.sql(schemastring)
# Write to staging table
finaldf.write.saveAsTable("stage_metricsreport")

In another data engineering notebook that gets called once per country, we are parsing a bunch of JSON files stored in a particular folder. We pass that folder location as a parameter to the notebook, facilitated by a widget named “JSONFolder”.

%sql
with base as (select explode(element_at(results.tables.rows[0],1 ) ) r
from json.`${JsonFolder}`
)
Select * from base

Widget Tips

Here are a few things to keep in mind as you use Databricks widgets:

  • You can create a widget in Python and reference it in SQL.
  • You can pass values to a widget when calling a notebook programmatically.
  • You need the Can Manage permission for a notebook in order to configure the widget layout.
  • The setting to pin the widget panel is maintained per user, not just per notebook.

How do you use Databricks widgets?

If you have implemented or seen a creative or really effective use of widgets in Databricks notebooks, feel free to share in the comments.

Go forth and parameterize!

Microsoft Technologies, SQL Server, SSAS

Restoring SSAS Cubes to a SQL 2022 Server with CU5

I have a client who was upgrading some servers from pre-2022 versions of SQL Server to SQL Server 2022 CU7. They had some multidimensional SSAS cubes that were to go on the new server. But they ran into an issue after the upgrade. After restoring a backup of an SSAS database to the new server they found that they could no longer modify the data source using SSMS.

The error SSMS produced was: “Errors in the metadata manager. Multidimensional database ‘DBName’ is not using latest encryption schema. Please create a backup file and restore DB from backup file with the optioin EnsureProperEncryption to upgrade to the latest encryption.

SQL 2022 CU1 and later includes enhanced encryption for “certain write operations to the model database schema”. SQL 2022 CU5 includes a bug fix with the description “Adds additional enforcement of write operations to the encryption algorithm that’s used to encrypt data sources and connection strings in SQL Server Analysis Services (SSAS) models”.

The solution required two steps, as stated in the error message:

  1. Back up the SSAS database.
  2. Restore with the EnsureProperEncryption option enabled.

While you could do the backup using the SSMS GUI, the option to ensure proper encryption was not available there. Microsoft Support gave us the XML to perform the backup and restore, so I’m sharing it here in case someone else runs into this issue.

Backup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<Batch xmlns="http://schemas.microsoft.com/analysisservices/2003/engine" Transaction="false">
  <Backup xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
<Object>
<DatabaseID>your_databse_id</DatabaseID>
</Object>
<File>your_backup_file_pathname</File>
<AllowOverwrite>true</AllowOverwrite>
<ApplyCompression>false</ApplyCompression>
</Backup>
</Batch>

Restore

1
2
3
4
5
6
7
<Batch xmlns="http://schemas.microsoft.com/analysisservices/2003/engine" Transaction="false" xmlns:ddl100_100="http://schemas.microsoft.com/analysisservices/2008/engine/100/100" xmlns:ddl100="http://schemas.microsoft.com/analysisservices/2008/engine/100">
<Restore xmlns="http://schemas.microsoft.com/analysisservices/2003/engine" xmlns:ddl922="http://schemas.microsoft.com/analysisservices/2022/engine/922">
<File>your_backup_file_pathname</File>
<AllowOverwrite>true</AllowOverwrite>
<ddl922:EnsureProperEncryption>true</ddl922:EnsureProperEncryption>
</Restore>
</Batch>

Remember that the backup file should have a .ABF extension and the file path should be a UNC path.

Up and Running

After the backup and restore using the code above, the SSAS database was back in action and the client was able to modify the data source settings again.

Data Visualization, Deneb, Microsoft Technologies, Power BI

Quick Tip About Fonts in Deneb Visuals in Power BI

This week, I was working with a client who requested I use the Segoe UI font in their Power BI report. The report contained a mix of core visuals and Deneb visuals. I changed the fonts on the visuals to Segoe UI and published the report. But my client reported back that they were seeing serif fonts in some visuals. I couldn’t replicate this on my machine while viewing the report in a web browser or in Power BI Desktop.

After a bit of investigating, it was determined that the issue only occurred inside Deneb visuals when viewed in a browser on a Mac. I reached out to Daniel Marsh-Patrick, maker of the Deneb custom visual, to see if he had a solution. He suggested that “Segoe UI” may not be a sufficient name for the font outside of Windows and that I could add some fallback fonts in my font properties. A fallback font is a font face that is used when the primary font face is not available when a page is loaded. 

Segoe UI is not a font that is natively available on Mac OS. It is available in the font library for Office on Mac OS, but that doesn’t help when viewing a report in a web browser.

So, I updated the configuration in my Deneb visuals. Anywhere that I had defined the font, titleFont, or labelFont properties, I changed the value from "Segoe UI" to: "Segoe UI, wf_segoe-ui_normal, helvetica, arial, sans-serif",. You can see what this looks like in my config settings below.

The config window in the Deneb custom visual editor showing the config  settings for the fonts now include fallback fonts
A portion of the config settings from a Deneb visual in Power BI with fallback fonts added

Lessons Learned

I learned two things from this:

  1. Test the Power BI report on a Mac to ensure a consistent experience.
  2. Make a habit of adding fallback fonts instead of specifying just one font.

I really enjoy the power and flexibility of the Deneb visual. But I’m definitely still learning about some of the nuances of using Vega-Lite such as the font issue above (which is handled by Power BI Desktop behind the scenes when we use core visuals).

Azure, Microsoft Technologies, Power BI

Creating and Configuring a Power BI VNet Data Gateway

If you are using Power BI to connect to a PaaS resource on a virtual network in Azure (including private endpoints), you need a data gateway. While you can use an on-premises data gateway (the type of Power BI gateway we have had for years), there is an offering called a virtual network data gateway that is currently in preview.

The VNet data gateway securely communicates with the data source, executes queries, and transmits results back to the service, just like the on-premises data gateway. But it doesn’t require us to provision a virtual machine in the same network (or a peered network) of our Azure data source.

What to Know Before You Start

There are a few limitations to be aware of before you use a VNet Data Gateway.

  1. While in public preview, Virtual network data gateways are available only in workspace configured with a license mode of Power BI Premium, Premium Per User (PPU), or Power BI Embedded. If you are sticking with Pro licenses, you aren’t currently able to use VNet data gateways.
  2. VNet data gateways are not available in all regions. The gateway has to match the region of the target VNet but not your Power BI home region. That being said, there are some regions currently missing from the list, such as West US 3, Canada East, South Africa West, and all sovereign regions.
  3. Power BI dataflows and datamarts are not supported.
  4. Your VNet must be in the same tenant as the Power BI tenant containing VNet data gateway.

There are some other limitations, which you can read about here, but the four above were the big ones for me.

Creating Your virtual network data gateway

You can create a VNet data gateway in PowerBI.com. You may also need access to your Azure subscription. There are three steps to creating a VNet data gateway:

  1. Register Microsoft.PowerPlatform as a resource provider in the subscription containing the target VNet.
  2. Add a new subnet to your VNet and associate it with the Microsoft Power Platform.
  3. Create the VNet data gateway in Power BI or the Power Platform Admin center.

To register the resource provider, go to your subscription in the Azure Portal, select Resource Providers on the left, and search for “Microsoft.PowerPlatform”. If that provider is not registered, select it and choose Register.

The subscription page in the Azure Portal showing Resource Providers. The Microsoft.PowerPlatform resource is registered.
Make sure the Microsoft.PowerPlatform resource provider is registered in your Azure subscription

On your target VNet, you’ll need to create a new subnet. During the creation of the subnet, make sure to set the Subnet Delegation to Microsoft.PowerPlatform/vnetaccesslinks.

Creating a new subnet on a VNet in Azure. The subnet delegation is set to  Microsoft.PowerPlatform/vnetaccesslinks.
When creating a subnet in an Azure VNet, make sure to set the subnet delegation for use with the VNet data gateway

If you did not successfully complete steps 1 and 2, you won’t be able to select the VNet and subnet in step 3.

To create the VNet Data Gateway in Power BI, go to the settings button, then choose Manage Connections and Gateways. The page will open on Connections, so you’ll need to switch to Virtual network data gateways.

The Power BI Manage Connections and Gateways page with Virtual network data gateways selected
On the Manage Connections and Gateways page, select the Virtual network data gateways in the top navigation

After you click new, you’ll have to choose your subscription, resource group, VNet, subnet, a name for your gateway, and high availability/load balancing settings.

If you’ve accidentally chosen a region where the VNet data gateway is unavailable, you will get the error “Invalid request”. (Helpful, I know.)

Using Your VNet Data Gateway

You’ll need to create a data source in Power BI, just like you would if you were using an On-Premises data gateway. But you’ll need to select Virtual Network Data Gateway at the top before you enter your connection info. Your new VNet data gateway should be available for selection in the list.

The VNet data gateway can be used for imported or DirectQuery connections. If you want to enable SSO via Azure AD for DirectQuery queries, you can. Otherwise, the credentials you enter on this connection will be used to connect for imported dataset refresh or DirectQuery.

After you publish a dataset to PowerBI.com that connects to the data source you just defined in Connections, you’ll need to configure the dataset to use that connection.

In the dataset settings, when you expand Gateway Connections, you should see the data source listed. There will be a drop-down box labeled “Maps to:”. Use that to associate the data source with the previously created connection.

Gateway connections on the dataset settings page in Power BI. The Maps To setting is set to the connection created in the previous step.
Use the Maps to setting to associate a data source to a connection that uses the VNet data gateway

Then you are all set.

While there are some limitations, and it’s unfortunate that it only works with Premium/PPU/embedded workspaces at the moment, it’s nice not to have to manage a virtual machine in Azure just to allow Power BI to connect to my Azure SQL database via a private endpoint.

Data Visualization, Microsoft Technologies, Power BI

Enhancements I’d Like to See in the Power BI Treemap Visual

I recently created a treemap in Power BI for a Workout Wednesday challenge. Originally, I had set out to make a different treemap, but I ran into some limitations with the visual. I ended up with the treemap below, which isn’t bad, but it made me realize that the treemap is in need of some improvements to make it really useful. So I decided to share my thoughts here.

A Power BI treemap that groups S&P 500 stocks by sector with rectange sizes representing market cap.
A treemap built in Power BI

Performance

The treemap above takes a long time to render. I populated the treemap with one field in Category, one in Details, one in Values, and four fields in Tooltips. The fields all come from an imported dataset containing a single table with 239 rows. I’m running the latest version of Power BI Desktop (July 2023) on a machine with an Intel Core i9-11900H processor (8 cores/16 threads, 2.5 GHz clock speed) and 64 GB of RAM. This is likely more than the average Power BI Desktop user has available to them.

Changing the color of one category in the format pane consistently requires a wait time of about 14 seconds for me before the treemap finishes re-rendering.

If I use Performance Analyzer to capture this action, we can see the last category color change took 15463 milliseconds (15 seconds), and almost all of that time was spent on visual display.

A cell in the Performance Analyzer results that shows the treemap taking 15463 ms to render. 15386 ms are used for visual display, and 77 ms are used for Other.
The Power BI treemap took over 15 seconds to render

This seems to be related to the presence of the tooltip fields. If I remove all the tooltip fields, it takes between 1 and 2 seconds, which is much more acceptable.

Category Hierarchies and Drilldown

Power BI allows us to include multiple categories and the ability drill up/down though them. This is good because treemaps are really meant for large amounts of hierarchical data. If you drill down from the top level, it will concatenate the category names. For example, if my top level is sector and my second level is country, and I choose to expand all down one level in the hierarchy, I will see a value like “Technology United States”. While that is a good start, it would be great to include some kind of delimiter in there so it’s easier to read; e.g., “Technology | United States” or “Technology ↳ United States”. The other problem with drilldown and category labels is that they get truncated when they hit the edge of the category, and there is no way to have it wrap to a new line or even just make the text for that category smaller to fit without changing all category labels to be smaller.

A treemap with a category label that has been truncated
Category names that don’t fit in the section of the treemap are truncated

The more important feature for me is to be able to see multiple levels of the category at once, instead of treating them like all one level when you drill down. In Power BI today, when you expand down through multiple levels of the category hierarchy, you just get more colors (1 color per combination of values in the categories). That gets messy pretty quickly. I want to be able to make something like Figure 1 in this blog post.

Part of what makes that tree map work is that there is more space/wider borders between the top-level categories. Currently, the Power BI treemap doesn’t provide me with any control over the borders around categories. I want to be able to make the border around categories thicker and change the color. The color is important for accessibility. I need to be able to create good contrast between the rectangle background color, the category border, and the visual background, so it’s clear where a category ends. Maybe that can be accomplished with border width alone, but being able to change colors would be nice.

It would also be great to be able to label the top-level category outside of the rectangles, as is shown in this treemap.

Details and Labels

The other big feature I think we need is better control over the detail labels. We recently (May 2023) got the ability to create measure-driven data labels in some visuals such as bar, column, and line charts. I’d like to see that added to treemaps as well. What’s currently missing from measure-driven data labels is the ability to add line breaks. I think line breaks would help with fitting information into the rectangles.

It would also be nice to be able to control the border color and width around the detail rectangles as well. Right now we can’t change the rectangle borders in the chart, and it can make the rectangles hard to see depending on the color used and the chart background color.

As with categories, it would be nice to be able to allow the detail labels size to grow/shrink with the size of the rectangle.

Colors

Currently, the Power BI treemap allows conditional formatting of colors only if you don’t have any fields in the Details field well. In my treemap, I wanted to see the individual stocks grouped by sector. The only way to accomplish that was to put the stock symbol in the Details. If I put both sector and stock symbol in the categories and expand all the categories down one level, I can see the sector title and the individual rectangles per stock, but the stocks are no longer grouped by sector. The stocks related to technology are not placed next to each other in the treemap. So while I can implement conditional formatting to color all technology stocks the same color, I lose the ability to locate all technology stocks together.

A Power BI treemap with no categorical grouping that looks very chaotic.
A Power BI treemap with sectory and stock symbol placed in categories and conditional formatting used to color all technology stocks the same color

The inability to use conditional formatting on treemap colors when I have a field in the Details also means I cannot use proximity/location to encode the category and use color to encode another attribute or measure. That is what is done in the finviz treemap: they group stocks by sector and use size to indicate market cap, but the color is used to indicate stock price change over time.

Other Visual Improvements

Right now, I think the the limitations of the treemap visual keep it from being really useful. But I am excited to see the visualization improvements to other core visuals become available after months (years?) of stagnation. If you’d like to learn more about the planned improvements for core visuals, you can check out this AMA with Miguel Myers (PM at Microsoft who has set a new vision for core visuals in Power BI). You may also want to keep an eye on the PBI Core Visuals posts on LinkedIn. These planned improvements are, in my opinion, rightfully a higher priority than fixing the treemap.

Once the more general improvements across visuals are made (some of which may effect the treemap along with other visuals) and the most commonly used visuals (line charts and bar charts) are updated, I hope to see some of my requested improvements to the treemap visual.

Share Your Thoughts

What would you add to my list of treemap improvements? Are there any other enhancements you’d like to see? Do you know any cool tricks or workarounds for treemaps? Please share your thoughts in the comments below.