I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.
Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.
Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").
This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.
Case sensitivity in SQL Server
SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.
Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.
CREATE TABLE [dbo].[Products] (
[ProductName] nvarchar(100) NOT NULL
ALTER TABLE [dbo].[Products]
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)
I can insert my first row.
Insert into dbo.Products (ProductName)
Now I try to insert another row where the letters are capitalized.
Insert into dbo.Products (ProductName)
This fails with the following error:
Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).
This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).
If you need to get to a list of case insensitive distinct values, you can do the following:
Create a column that contains the values of the ProductName converted to lower case.
Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.
Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.
You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.
ADD SENSITIVITY CLASSIFICATION TO
WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')
sys.all_objects.name as [TableName],
sys.all_columns.name as [ColumnName],
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id
Be Careful When Loading With CTAS and Rename
One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:
Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
Rename the dimension table to something like Dimension_OLD
Rename the upsert table to Dimension
Drop the Dimension_OLD table
In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.
Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.
There are a few complicating factors:
There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.
The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.
Update: Someone asked about the name SwapWithMetadata and why it doesn’t specifically mention sensitivity classifications. I didn’t mention classifications because there are other things that need this same treatment. Dynamic data masking will also need to be reapplied. With dynamic data masking, it will be even more important to add it back immediately after swapping the tables rather than waiting for a full data load of all selected tables to finish and adding all classifications back. If your load takes a long time or the process fails on another table, you don’t want your data exposed without a mask to users who shouldn’t see the full information.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.
For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.
Vote for My Enhancement Idea
If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.
Not an Issue with Other Data Load Methods
Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.
Give Data Classifications a Try
I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.
Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?