Azure, Azure SQL DW, Data Warehousing, Microsoft Technologies, T-SQL

What You Need to Know About Data Classifications in Azure SQL Data Warehouse

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.

ADD SENSITIVITY CLASSIFICATION TO
    dbo.DimCustomer.Phone
    WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')

To view existing data classifications, you can query the sys.sensitivity_classifications view or look in the Azure Portal.

SELECT
sys.all_objects.name as [TableName], 
sys.all_columns.name as [ColumnName],
[Label], 
[Information_Type], 
FROM sys.sensitivity_classifications
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
    and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id

Be Careful When Loading With CTAS and Rename

One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:

  1. Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
  2. Rename the dimension table to something like Dimension_OLD
  3. Rename the upsert table to Dimension
  4. Drop the Dimension_OLD table
Animation of a table load process in Azure SQL DW


In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.

Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.

There are a few complicating factors:

  • There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
  • ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
  • The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
  • SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.

The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.

Update: Someone asked about the name SwapWithMetadata and why it doesn’t specifically mention sensitivity classifications. I didn’t mention classifications because there are other things that need this same treatment. Dynamic data masking will also need to be reapplied. With dynamic data masking, it will be even more important to add it back immediately after swapping the tables rather than waiting for a full data load of all selected tables to finish and adding all classifications back. If your load takes a long time or the process fails on another table, you don’t want your data exposed without a mask to users who shouldn’t see the full information.

CREATE PROC SwapWithMetadata
@SrcSchema NVARCHAR(128),
@SrcTable NVARCHAR(128),
@DestSchema NVARCHAR(128),
@DestTable NVARCHAR(128),
@TransferMetadata BIT,
@DropOldTable BIT
AS
BEGIN
SET NOCOUNT ON
BEGIN TRY
–Check if destination table exists
DECLARE @DestSchemaQualifiedTableName NVARCHAR(257)
SET @DestSchemaQualifiedTableName = @DestSchema + '.' + @DestTable
IF OBJECT_ID(@DestSchemaQualifiedTableName) IS NULL
BEGIN
DECLARE @DestErr NVARCHAR(MAX)
SET @DestErr = 'Table ' + @DestSchemaQualifiedTableName + ' not found'
RAISERROR(@DestErr, 15, 1)
END
–Check if source table exists
DECLARE @SrcSchemaQualifiedTableName NVARCHAR(257)
SET @SrcSchemaQualifiedTableName = @SrcSchema + '.' + @SrcTable
IF OBJECT_ID(@SrcSchemaQualifiedTableName) IS NULL
BEGIN
DECLARE @SrcErr NVARCHAR(MAX)
SET @SrcErr = 'Table ' + @SrcSchemaQualifiedTableName + ' not found'
RAISERROR(@SrcErr, 15, 1)
END
–Move destination table to destination_old. Move source table to destination
DECLARE @RenameSql NVARCHAR(MAX)
SET @RenameSql = 'RENAME OBJECT ' + @DestSchemaQualifiedTableName + ' TO ' + @DestTable + '_old; '
Set @RenameSql = @RenameSql + ' RENAME OBJECT ' + @SrcSchemaQualifiedTableName + ' TO ' + @DestTable
PRINT 'Executing ' + @RenameSql + ' …'
EXEC sp_executesql @RenameSql;
–drop temp table if it exists
IF OBJECT_ID('tempDB..#tempApplySensitivityClassificationsToTable') IS NOT NULL
DROP TABLE #tempApplySensitivityClassificationsToTable;
–check if we should transfer data classifications from old to new table
IF ISNULL(@TransferMetadata,0) = 1
BEGIN
–put current classifications in a temp table
DECLARE @OldTable NVARCHAR(128) = @DestTable + '_old';
WITH CurrentClassifications as (
SELECT
CAST('dbo' as NVARCHAR(128)) [Schema],
CAST(sys.all_objects.name as NVARCHAR(128)) [Table],
CAST(sys.all_columns.name as NVARCHAR(128)) [Column],
CAST([Information_Type] as NVARCHAR(128)) [Informationtype],
CAST([Label] as NVARCHAR(128)) [Label]
FROM
sys.sensitivity_classifications
LEFT OUTER JOIN sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
LEFT OUTER JOIN sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id
)
SELECT ROW_NUMBER() OVER (ORDER BY [Schema], [Table], [Column]) [ID],
[Schema], [Table], [Column], [Informationtype], [Label]
INTO #tempApplySensitivityClassificationsToTable
FROM CurrentClassifications
WHERE [Schema] = @DestSchema AND [Table] = @OldTable;
DECLARE @i INT
SET @i = 1
DECLARE @Max INT
SELECT @Max = COUNT(*)
FROM #tempApplySensitivityClassificationsToTable;
PRINT 'Transferring ' + CAST(@Max as VARCHAR(4)) + ' classifications'
–drop and recreate sensitivity classifications
DECLARE @Sql NVARCHAR(MAX)
DECLARE @Col NVARCHAR(128)
DECLARE @InfoType NVARCHAR(128)
DECLARE @Label NVARCHAR(128)
WHILE @i <= @Max
BEGIN
SELECT @Col = [Column], @InfoType = [InformationType], @Label = [Label]
FROM #tempApplySensitivityClassificationsToTable
WHERE Id = @i
SET @Sql = 'DROP SENSITIVITY CLASSIFICATION FROM ' + @DestSchemaQualifiedTableName + '.' + @Col
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
SET @Sql = 'ADD SENSITIVITY CLASSIFICATION TO ' + @DestSchemaQualifiedTableName + '.' + @Col
IF (@InfoType IS NOT NULL AND @Label IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (LABEL=''' + @Label + ''', INFORMATION_TYPE=''' + @InfoType + ''')'
END
ELSE IF (@InfoType IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (INFORMATION_TYPE=''' + @InfoType + ''')'
END
ELSE IF (@Label IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (LABEL=''' + @InfoType + ''')'
END
ELSE
BEGIN
SET @Sql = NULL
END
IF (@Sql IS NOT NULL)
BEGIN
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
END
SET @i = @i + 1
END
END
IF ISNULL(@DropOldTable,0) = 1
BEGIN
DECLARE @DropSql NVARCHAR(MAX)
SET @DropSql = 'DROP TABLE ' + @DestSchemaQualifiedTableName + '_old;'
PRINT 'Executing ' + @DropSql + '…'
EXEC sp_executesql @DropSql;
END
END TRY
BEGIN CATCH
Print 'ERROR… Procedure: ' + ERROR_PROCEDURE() + ' Message: ' + ERROR_MESSAGE()
END CATCH
END

Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.

For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.

Vote for My Enhancement Idea

If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.

Not an Issue with Other Data Load Methods

Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.

Give Data Classifications a Try

I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.

Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?

Data Visualization, Microsoft Technologies, Power BI

Turning a Corporate Color Palette into a Data Visualization Color Palette

Last week, I had a conversation on twitter about dealing with corporate color palettes that don’t work well for data visualization. Usually, this happens because corporate palettes are designed with websites and/or marketing collateral in mind rather than information graphic design. This often results in colors being too bright, dark, or dull to be used together in a report. Sometimes the colors aren’t easily distinguishable from each other. Other times, the colors needed for various situations (main color, ancillary colors, highlight color, error color, KPIs, text, borders) aren’t available in the corporate palette.

You can still stay on brand and create a consistent user experience with a color palette optimized for data visualization. But you may not be using the exact hex values as defined in the corporate palette. I like to say the data viz color palette is “inspired by” the marketing color palette.

I asked on twitter if anyone had a corporate color palette they needed to convert into a data visualization palette, and someone volunteered theirs. So this post is my walk-through of how I went about creating the palette.

Step 1: Identify a Main Color

There is often a main color in the corporate color palette. If that color is a medium intensity color, I usually include that color in my color palette as is. If it is excessively dark, light, or gray, I’ll either tweak the color a bit or use the second color in the color palette.

Step 2: Choose a Color Scheme

Next, I need to decide what kind of relationship the other colors will have with the main color. In other words, I have to decide what type of color scheme I want to use. I tend to go for monochromatic or analogous color schemes. Complimentary color schemes can be difficult, depending on your main color. I generally try to stay away from using reds and greens together in the same palette because it’s hard to stay colorblind-friendly and because the primary colors together can make it feel like a Christmas or kindergarten theme. I often try to reserve reds and oranges to draw attention to specific data points, but that isn’t a hard and fast rule.

I need 2 – 4 ancillary colors to go with my main color. I rarely need to use all 4 colors together in one chart, but there are some cases such as line charts with 4 series where that will be necessary. People can preattentively distinguish up to about 7 colors at once, so I need to use fewer than 7 colors in a single chart. If I encounter a situation where I feel like I need more than 4 colors together, I re-evaluate my choice of chart type and my use of color within the chart.

Also, I want the colors to be roughly the same level of brightness and intensity. Most importantly, the colors need to be easily distinguishable from each other.

RGB Color Wheel

Step 3: Choose Highlight and Error Colors

We often need to draw attention to specific data points to indicate that they require attention. This is usually because a value is outside of the expected range. KPIs are common in Power BI reports, I need to make sure I have a color to indicate “bad” statuses. I also like to have a highlight color that doesn’t necessarily mean “bad”, just “look here”. These highlight and error colors need to be noticeably different from my other colors so that they draw attention to the data points where they are used.

Step 4: Add Border and Background Colors

I like to add grays and browns to go with my color scheme. I’ll use them mostly for borders, grid lines, text, and light background shades. But also, I want to make sure I have 8 colors in my palette. If I have fewer than 8 colors, Power BI will add colors from the default palette at the end of my colors to fill out the full 8 columns.

Color Palette Creation Example

The original corporate color palette that I was given had a lot of colors.

Primary Corporate Colors
Secondary Corporate Colors

The primary colors go all the way around the color wheel. I definitely don’t want to use them all together. The secondary colors have the beginnings of a monochromatic blue palette, an analogous blue/green palette, or an analogous orange/red/purple palette.

I don’t need all of these different hues. I need 8 medium-intensity colors. Power BI will add black and white and provide the shades and tints for me.

Main Color

I’m keeping the main color as it is. It is bright and saturated enough to not be dull/boring and also not so bold as to leave no room for bolder colors to be used to highlight specific data points.

Chart using the main color of my palette

Color Scheme

I choose an analogous color scheme, which means I pick colors that are next to my main color on the color wheel. Since blue is my main color, I stick with cool colors for the ancillary colors.

Main color plus 3 ancillary colors

I want my 4 colors to be easily distinguishable from each other, and I want them to be roughly the same intensity and brightness.

Highlight and Error Colors

I’m adding yellow and red to my palette. The yellow can be a generic highlight color as well as a “caution” color. The red can be my “bad” color. I’m checking that my colors are easily distinguishable for various types of color vision deficiency.

I confirm that my highlight and error colors are easily distinguishable from the other colors for the most common types of color vision deficiency. I can also see here that my second and fourth colors look a bit similar on the deuteranopia line, so I’ll have to be careful how I use them together, perhaps switching to a shade or tint of the fourth color if needed.

Border and Background Colors

Now I add my grays and browns to use for formatting. This completes my color palette.

Full color palette for Power BI Theme

Power BI Theme

I can take the hex values for my colors and drop them in the color theme generator on PowerBI.Tips to get my JSON theme file.

{
"name":"Analogous",
 "dataColors":["#00aaff", "#00c2b2", "#213dca", "#7514ff", "#ffd500", "#ff002b","#768389","#987665"]
 }

When I import my theme file into my Power BI report, I get the additional tints and shades from the colors I provided.

Next I try out my new color theme in a report to see if I need to tweak any colors. This is the true test. The colors may look great in little boxes, but they might need to be altered to work on a full report page. The shade of purple that I used originally (not shown in this blog post) was a bit too intense compared to the other colors, so I replaced it with a slightly muted tint that better matched the other colors. That is the type of thing you will notice when applying your theme to a report. Don’t get too stuck on finding the exact perfect colors. Colors look slightly different on different screens. Just make sure nothing is inadvertently distracting.

Helpful Color Tools

I’m currently using https://color.mediaandme.be to create my color palettes. It’s free, and it allows me to add many (> 6) colors to my palette. Other benefits:

  • It shows me what all the colors look like together
  • It provides a colorblindness simulator
  • It lets me easily tweak hue, saturation, and brightness
  • It generates a link for the color palette I create so I can easily share it with others for feedback

When I need ideas for how to tweak a color, I use https://www.colorhexa.com. I picked the gray color in my palette by getting the grayest tone of my main color from ColorHexa.

Further Reading on Color in Data Visualization

Tiger Color: Basic Color Schemes – Introduction to Color Theory
Maureen Stone: Expert Color Choices
Billie Gray: Another post about colours for data visualisation. Part 3 — DIY Palettes
Christopher Healy: Perception in Visualization