Azure, Microsoft Technologies, PowerShell, SQL Server

Using Azure Automation to Shut Down a VM only if a SQL Agent Job is Not Running

I have a client who uses MDS (Master Data Services) and SSIS (Integration Services) in an Azure VM. Since we only need to execute the SQL Agent job that runs the SSIS packages infrequently, we shut down the VM when it is not in use in order to save costs. We wanted to make sure that the Azure VM did not shut down when a specific SQL Agent job was still running, so I tackled this with some PowerShell runbooks in Azure Automation.

I split the job into two parts. The first runbook simply checks if a specified SQL Agent job is running and returns a text value that indicates whether it is running. A parent runbook checks if the VM is started. If the VM is started, it calls the child runbook to check if the job is running, and then shuts down the VM if the job is not running.

It’s fairly easy and convenient to have nested PowerShell runbooks in Azure Automation. There are two main ways to call a child runbook.

  1. Inline execution
  2. Using the Start-AzureRmAutomationRunbook cmdlet

It was less obvious to me how to call a child runbook when the parent runs in Azure and the child runs on a hybrid worker, especially when you need to use the output from the child runbook in the parent. A hyrid runbook worker allows us to access resources that are behind a VNET or on premises.

Travis Roberts has a nice video on just this topic that gave me the answers I needed.

Below is my parent runbook.

# Ensures you do not inherit an AzureRMContext in your runbook
Disable-AzureRmContextAutosave –Scope Process

$connection = Get-AutomationConnection -Name AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -Tenant $connection.TenantID `
-ApplicationID $connection.ApplicationID -CertificateThumbprint $connection.CertificateThumbprint

$rgName='MyResourceGroup'
$vmName='MyVM'
$SubID = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'

$AzureContext = Select-AzureRmSubscription -SubscriptionId $SubID
'Check if VM is on'
$vm=((Get-AzureRmVM -ResourceGroupName $rgName -AzureRmContext $AzureContext -Name $vmName -Status).Statuses[1]).Code
 $vm 
 if ($vm -eq 'PowerState/running')
 {
    Do 
    {
        #if VM running call other runbook
        start-sleep -Seconds 60;
        'Check if job is running'
        $JobRunning = start-azureRMautomationrunbook -AutomationAccount 'ProgramsAutomation' -Name 'CheckRunningSQLJob' -ResourceGroupName $rgName -AzureRMContext $AzureContext -Runon 'Backups' -Wait;
        Write-Output $JobRunning
        
        
    } Until ($JobRunning -eq 'run0')
    
    'Stopping VM'
    stop-azurermvm -Name $VMname -ResourceGroupName $RgName -force
}

The runbook sets the Azure context to the appropriate subscription (especially important when you are a guest user in someone else’s tenant). Then it checks if the VM is started. If it is, it goes into a do-while loop. This task isn’t super time sensitive (it’s just to save money when the VM isn’t in use), so it’s waiting 60 seconds and then calling the child runbook to find out if my SQL Agent job is running. This makes sure that the child runbook is called at least once. If the result is that the job is not running, it stops the VM. If the job is running, the loop starts over, waiting 60 seconds before checking again. This loop is essentially polling the job status until it sees that the job is completed. One thing to note is the -Wait parameter on the end of that Start-AzureRmAutomationRunbook command. If you don’t specify the -Wait parameter, the command will immediately return a job object. If you specify the -Wait parameter, it waits for that child job to complete and returns the results of that job.

And here is my child runbook.

[OutputType([string])]

$SQLJobName = 'MySQLAgentJobName'
$SQLInstanceName = 'MySQLServer

$cred=Get-AutomationPSCredential -Name 'mycredential'
 
$server = Connect-DbaInstance -SqlInstance $SQLInstanceName -SqlCredential $cred
 
Get-DbaRunningJob -SqlInstance $server | Get-DbaRunningJob

$JobStatus = (Get-DbaRunningJob -SqlInstance $server).Name -match $SQLJobName

If ($JobStatus -ne $false) 
{
#job is running. Passing back a string because bits and ints were causing issues.
    $JobRunning = 'run1'
    Write-Output $JobRunning 
}
else 
{ 
#job is idle
    $JobRunning = 'run0'
    Write-Output $JobRunning 
}

I’m using dbatools to check if the job is running on the server. That is the Get-DBARunning Job command. The important part to note is that you have to use the Write-Output command for this output to be available to the parent runbook. I got some weird results when I tried to return an int or a boolean (it was returning an object rather than a single value), so I just went with a string. The string, while not the most efficient, works just fine. If you understand why this is, feel free to leave me a comment.

These runbooks have been in place for a couple of months now, and they are working great to shut down the VM to save money while making sure not to disturb an important SQL Agent job that might occasionally run late. I didn’t find much documentation nor many examples of using output from a child job that runs on a hybrid worker, so I wanted to get this published to help others that go searching.

Data Warehousing, Microsoft Technologies, SSIS

My Preferences for SSIS Design

Lately, I have been using SSIS execution frameworks and Biml created by other people to populate data marts and data warehouses. It has taught me a few things and helped me clarify what I like and dislike compared to my usual framework. I’ve got the beginning of my preferences list started below. There are probably situations where I would want to deviate from my preferences, but I think they make a good starting point.

Populating Data

  • For self-service BI environments, a date dimension that doesn’t go out much further than the greatest date in your data. This can be a view or stored procedure that limits and updates dates rather than a static date dimension that goes out until the end of time.
  • Unknown values are included in normal dimension loads, not in separate scripts that must be run on deployment. This way, if an unknown value is ever left out or deleted, it will be added in the next data load rather than requiring a special execution of a script.
  • Every table should have InsertDateTime and UpdateDateTime columns. The UpdateDateTime column should be populated with the same value as the InsertDateTime column upon creation of the row, rather than being left null.
  • Whatever you use to create tables, include primary keys, foreign keys, and indexes with your table definitions. Provide explicit constraint names to simplify database comparisons. You can disable your foreign keys, but they need to be there to provide that metadata.
  • Separate your final dimensional/reporting tables from audit tables and staging tables. This can be done with separate schemas or even separate databases.

Data Integration Process

  • There should be consistent error handling in each layer (staging, dims, facts, etc.). If you write errors to another location (flat file, database table), have a process that notifies the right people that errors occurred. The process of consuming corrected data must be built, tested, and integrated into the existing process.
  • Make your error handling process reflect what end users need to see when an error occurs. Does it make sense to have a partial load when there is an issue? Or should it be all or nothing?
  • Have smart master packages that determine which packages to run. Don’t check whether the package should run inside of the package itself – do that in the master package.
  • Master packages should execute child packages in parallel as much as possible rather than defaulting to sequential execution.
  • Have an audit log with one row per package. Include the SSIS ServerExecutionID in the audit log – not the package -specific ID but the execution ID for the entire run. If there are incremental loads, the where clause used to filter the load should be captured in the audit table. Include row counts as well as package start and stop time in your audit log.
  • Add an AuditLogID column on your dimension, fact, and staging tables so you can trace each row back to the process that populated it.
  • For dims and facts, perform change detection/deduplication of records, usually through hash values and either SSIS lookups or SQL queries with WHERE NOT EXISTS.
  • Avoid T-SQL MERGE statements. Write individual insert/update/delete statements. This avoid any bugs in MERGE and makes your SQL easier to understand and troubleshoot.
  • Use consistent naming of tasks, source, destinations, packages, connection managers, etc. Connection managers pointing to databases should have names that refer to the database rather than the server.
  • If you are downloading files, move the files to an archive folder once files are processed. You can have rules in place if you have retention limits. But you probably need to keep files from at least the last load for audit and troubleshooting purposes. This could change if you are importing very sensitive data.
  • Even if you need to copy all columns from a table, write a select statement for database sources that explicitly names fields rather than using SELECT *. or just selecting the table or view.
  • SSIS lookups should use an explicit query rather than referencing an entire table.
  • Implement restartability at the package level for most packages (you should have single-purpose packages executed by a master package). Checkpoints are ineffective within a package. If you build your audit log table correctly, you can get the list of packages that have not run in the last X minutes/hours and feed that to your master package.
  • Send email from your scheduling tool rather than within an SSIS package.
  • Track data lineage in your tables. This can be as simple as having a table that lists all of your data sources with an ID column and including that ID value in each row of your staging, fact, and dimension tables.
  • Dims and facts are not truncated. Data should be inserted and updated (and deleted, if necessary).
  • Connection strings used in multiple packages should be project-level connection strings.

Biml Specifics

  • Understand whether you need a flexible Biml Framework or just an accelerator for a current project. If you need flexibility, don’t hardcode connection strings and other things that change when you add/change sources and destinations. If you just need to accelerate development of a simple data mart, total flexibility may be overkill and actually cause more work.
  • Have a single place where you add synthetic metadata, as much as possible. BimlScript gets messy and difficult to understand when you have some extended properties that are read in, some annotations added directly, and some variables defined in your code. This is why I like synthetic metadata stored in a database. Also, extended properties don’t exist in Azure SQL Data Warehouse, so if you need your framework to work there you can’t go that route.
  • Don’t repeat your code in multiple files. If you have some logic that gets reused, move it to a separate file and reference it from other files.

What Do You Think?

What’s on your SSIS preferences list? Do you disagree with one of my preferences and want to share your knowledge? Let’s chat in the comments.

Data Visualization, Microsoft Technologies, Power BI

Check Out the Updated Violin Plot Power BI Custom Visual

I wrote about the violin plot custom visual by Daniel Marsh-Patrick back in February. I thought it was a good visual then, but version 1.3 has recently been released with some nice enhancements.

First, the violin plot is now a certified custom visual. This means that it has been tested by the Power BI team to ensure it meets certain requirements, one of which is that the visual does not access external services or resources. You can be confident your data isn’t being sent externally when you use the violin plot.

As for the functional enhancements, a new legend has been added. This is a great addition to make the chart clearer and more easily read, especially for audiences that may not be familiar with how the violin plot works. The customizable legend calls out what markers are used for mean, median, and quartiles.

Violin plot with the new legend

Another good enhancement is the new column option for the combo plot. It allows you to have your plot show as a range column chart where the bar spans from the minimum value to the maximum value for each category. I chose to show only the mean and median in the example below, but you can also add quartiles.

Violin plot using the new column plot type

The barcode plot also has a nice enhancement in the tooltip. Now when you hover over a bar, you can see the number of samples with the highlighted value.

You can check out Daniel’s blog post to see the full list of enhancements for this release. Tweet me if you make something cool and shareable with the violin plot in Power BI.

Data Warehousing, Microsoft Technologies, SQL Server, SSIS, T-SQL

Why We Don’t Truncate Dimensions and Facts During a Data Load

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Surrogate Keys

Dimensional models use surrogate keys rather than natural keys. Surrogate keys are system-generated, meaningless values that are usually integers used to uniquely identify a record. They provide good performance for joins in queries, allow us to switch or use multiple source systems to feed the same tables, and facilitate the use of slowly changing dimensions. If you truncate a dimension table and then repopulate it, you will end up with different surrogate keys assigned to your dimension values. Let’s say we have a Geography dimension that looks like the table below.

GeographyKeyCityStateProvinceCountryPostalCode
1DenverColoradoUnited States80205
2DallasTexasUnited States75201
3ParisÎle-de-FranceFrance75001

If I truncate and reload this table, there is no guarantee that Denver will have a GeographyKey value of 1. I might reload the data and then have Paris as 1, Denver as 2, and Dallas as 3. And since we use surrogate keys as dimension lookups in fact tables and bridge tables (and snowflaked dimensions, if you use those), I would now need to update every table that references this dimension. That’s a lot of table updates without a good reason.

Error Handling

Another reason to avoid truncating the tables in your dimensional model is error handling. When you design your data load, you need to think about what should happen when it fails and where it is most likely to fail. Failures will often occur either in data transformation steps or upon inserting/updating values in the destination table due to data type/size conflicts.

Let’s say you truncate your fact table and then you encounter an ETL failure while performing the transformations to reload the table. Now you have an empty fact table. If you have error handling logic in your ETL, you may be able to redirect the error rows to another location to be handled later. But that still leaves you with an incomplete table at best. If instead of truncating and reloading, you were doing inserts and updates when you encountered an error, your table might reflect slightly stale data as of the previous load. You still might have a situation where your table is partially updated with the rows that were inserted before the error was encountered depending on how your ETL design. But having a fact table with old or partially updated data is usually (but not always) more preferable than having no data when a data load process fails.

Performance and Data Availability

When you truncate and reload a table, you are assuming you have access to all of the source data needed to fully repopulate the table. This may be true today, but what about next year when you switch CRM systems? Or what if your organization makes the decision to archive data in the source system that is older than 5 years? If any data needed for your fact or dimension becomes unavailable, your truncate and reload pattern will fail to serve your needs.

Even if you are sure your data will be available, you may want to consider your data load times. If you only have a handful of tables with a few hundred thousand rows max, you may load your data warehouse in a few minutes. But what happens when you have a few hundred million rows with some complicated transformation logic in your pipelines? You might be adding minutes to hours to your load times.

Why Do ETL Developers End Up Truncating Fact Tables?

Sometimes developers just don’t know better. But often there were understandable reasons for using the truncate and reload pattern. While I have never built a system where I truncated a dimension, I have had a couple where we at least started out with truncating the fact table. We usually built it to make data initially available with correct values, and then worked with stakeholders to find a different way to access the data or to have them generate the data differently to alleviate the problems that caused us to want to truncate and reload.

Often the cause is data quality issues. On one project I worked, we had dimension values that defined the granularity of the fact table that would change in the source systems, and the process to try to update them in the fact table was too complicated and took too long. So we made the decision to truncate and reload, understanding the risks of doing so. Having the data available and usable for analysis gave us more information on how useful the data was to users and helped us work to understand why and how our keys were changing. Over time, we were able to influence the way data was entered into source systems so that we didn’t have to go through the truncate and reload process and could perform incremental loads on our fact table.

Sometimes source systems allow hard deletion of data and don’t maintain a list of what was deleted. And depending on access methods and the size of the data, it can be difficult to compare the rows in a fact table with the source data to identify the deleted rows. I would argue this is a poor design of the source system, but we often can’t change that, and we still need to include the data in our data warehouse. So it’s understandable why someone in this situation would want to truncate and reload a fact table.

If your stakeholders are ok with the potential of long load times, empty tables when errors occur, and the assumption that source data will always be available, then there isn’t much problem with truncating and reloading a fact table. But that is often not the case. And that decision should be made explicitly rather than implicitly.

What To Do Instead of Truncating and Reloading a Fact Table

If you’ve been truncating and reloading a fact table because that just seemed like the simplest thing to do, you can change your load pattern.

My general load pattern is:

  1. Truncate the update table
  2. Insert new rows into the fact table and changed rows into the update table
  3. Perform set-based updates on the fact table based upon the data in the update table

I’m able to determine what rows are new, changed, and unchanged by using hash values. I concatenate and hash the values that define the level of uniqueness of the row into one value I call HistoricalHash and the values from the remaining columns into a value I call ChangeHash. In other words, I load a transactional or periodic snapshot fact table in a manner similar to a Type 1 slowly changing dimension.

If you have data quality, data deletion, or other issues that prevent you from using a change detection pattern like the above, consider using a staging table and swapping it out with the fact table. Create a staging table that looks exactly like your fact table.

  1. Truncate the staging table
  2. Populate the staging table with all rows that should be in the fact table
  3. Swap the staging table and the fact table (this usually involves renaming the tables)

Understand the Consequences of Your Design

If you are choosing to truncate a dimension or fact table, be sure that you understand the trade-offs. You may be unwittingly opting for simplicity of ETL over data availability and efficient data loads.

Consulting

So You Want to Be A Business Intelligence Consultant

I’ve had a few people ask me recently for advice on getting into business intelligence/analytics consulting. I have been in consulting for 6+ years and have worked for 3 different consulting firms, so I’m sharing my thoughts and experiences here in case they are helpful to someone else. My first response to people asking this question is usually to clarify what they mean by consulting.

What do you envision when you think about consulting?

Female consultant standing with laptop

Do you want to be an independent consultant, work through a placement agency, or work as a full-time employee (FTE) for a consulting firm?

Life looks a little different based upon your answer to this question. I have chosen to work for a consulting firm for a few reasons:

  • I like the stability of a salary and paid time off (vacation, holidays, sick days)
  • I don’t have to do my own accounting/taxes/collections
  • I’m not solely dependent on my sales efforts to bring in work
  • I like having smart coworkers that are in the trenches with me

If you are independent, you are responsible for everything. You must take care of (do yourself or outsource) your own accounting, legal, taxes, collections, sales, etc. Even if you outsource some of these functions, you need to stay somewhat involved – you can’t just throw your financial documents over the fence and wipe your hands of them.

And if you aren’t doing billable work and getting the client to pay you, you aren’t getting paid. This means you need some financial reserves to cover slow months.

My independent consultant friends tell me that sales takes up much more time than they had initially anticipated. They are always working to line up that next engagement to avoid having large gaps between projects. It takes work to build up a sales pipeline. Sometimes you can mitigate this by teaming up with other independent consultants or companies to have a lead referral agreement. They can refer clients to you when they need help in your area of expertise. But there is usually no guarantee of work from these other parties, and sometimes they take a cut of the revenue for sending the work to you. You can outsource a lot of things, but sales is rather difficult because you need the salesperson to understand what you do, the value it provides to clients, and what your values and preferences are.

Another issue you face as an independent consultant (and sometimes as a consultant for a very small consulting company) is how to get interesting projects that require multiple people. Let’s say you have found a potential client that wants to do some really cool analytics, but the project requires three people to work together. You’ll need to have relationships with other consultants with whom you can team up to work the project. And even if you have good people you can work with in general, the work needs to fit into their schedules .

As an FTE, I am still responsible for some sales efforts, but we also have a salesperson and other employees to help. And if I don’t sell anything, I still get paid. This is a great thing for my mortgage and my stress levels.

But as an independent consultant, you also are your own boss. You get to make all of the decisions about how and where to spend your time, what office and tech supplies to purchase, what your website looks like, how much vacation you can take, and who you want to do work for. Want to take a month off to travel the world or spend time with your family? You don’t need anyone else’s permission to do that. But you won’t be getting paid if you aren’t working. One way to mitigate this is to create content that can be sold so you have some passive income in addition to your normal consulting engagements. An example of this is creating a business intelligence course and selling it on your website or through an online learning/education company. You could continue to get revenue/royalties as people purchase the course. But do your homework to get realistic expectations about how much extra income this would mean relative to the time it takes to create and update the content.

If you want to be a contractor where you do full-time, often 3- to 12-month, engagements with a company, you can get your own work or you can go through placement firms. Some placement firms offer benefits and some guarantee of work in exchange for a share of the income resulting from your placement. Others take a flat fee (either from you or from the hiring company). This type of consulting is usually paid based upon an hourly rate.

Do you want to be a consultant who is more project-based where you have a specified scope and budget, and you may work for multiple clients at one time? Or do you want to be embedded full-time in an organization?

All consultants/contractors must complete (most of) the tasks and projects they take on, but managing scope and budget are special skills. Some people enjoy that challenge while others do their best to avoid it. If you are essentially a contractor who spends your time with a single client and does whatever comes up, you may have to spend less time and effort managing scope and budget. If your work is more project-based, this becomes an important part of your work.

If you do project-based consulting, you may occasionally have to work with multiple clients at once. This means you have to do a lot of context switching. I have had days where I worked on a Power BI sales report makeover for Client A, helped a coworker with some DAX over lunch, and built a data warehouse with product inventory data in Azure SQL DW in the afternoon for Client B. Clients don’t really think about what we do when we aren’t working for them. They are expecting us to mentally pick right back up where we left off when we get back to them and remember all the project and organizational details. I rather enjoy that variety, but I know that it gets stressful and isn’t a good fit for everyone.

Another aspect of project-based work is the amount of time you get to stay on with a client. Some consultants focus on implementations or health checks and then move on. They aren’t there 6 months later to see how things are going with that new system they implemented, unless the client calls them back. There is good and bad in this situation. The consultant may get a lot of experience with building systems/performing health checks in a short amount of time compared to an FTE not in a consulting role. But they may not be around to understand and experience the consequences of their design, unless they client calls them back. You can always work out retainer or support agreements to hedge this a bit, but it is up to your organization and the client to do so.

Project-based work can also mean you have less stability in your life. It’s difficult to anticipate exactly when an engagement will start or end, when you will need to travel, which nights you may need to work late. But project-based work can also offer exciting opportunities, a great variety of technical work, and the opportunity to learn about different companies and industries within a short amount of time.

What would work best for you?

There isn’t a single right answer to these questions. It’s about what would give you the income, benefits, lifestyle, and type of work that you want.

Azure, Azure SQL DW, Data Warehousing, Microsoft Technologies, T-SQL

What You Need to Know About Data Classifications in Azure SQL Data Warehouse

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.

ADD SENSITIVITY CLASSIFICATION TO
    dbo.DimCustomer.Phone
    WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')

To view existing data classifications, you can query the sys.sensitivity_classifications view or look in the Azure Portal.

SELECT
sys.all_objects.name as [TableName], 
sys.all_columns.name as [ColumnName],
[Label], 
[Information_Type], 
FROM sys.sensitivity_classifications
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
    and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id

Be Careful When Loading With CTAS and Rename

One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:

  1. Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
  2. Rename the dimension table to something like Dimension_OLD
  3. Rename the upsert table to Dimension
  4. Drop the Dimension_OLD table
Animation of a table load process in Azure SQL DW


In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.

Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.

There are a few complicating factors:

  • There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
  • ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
  • The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
  • SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.

The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.

Update: Someone asked about the name SwapWithMetadata and why it doesn’t specifically mention sensitivity classifications. I didn’t mention classifications because there are other things that need this same treatment. Dynamic data masking will also need to be reapplied. With dynamic data masking, it will be even more important to add it back immediately after swapping the tables rather than waiting for a full data load of all selected tables to finish and adding all classifications back. If your load takes a long time or the process fails on another table, you don’t want your data exposed without a mask to users who shouldn’t see the full information.

Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.

For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.

Vote for My Enhancement Idea

If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.

Not an Issue with Other Data Load Methods

Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.

Give Data Classifications a Try

I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.

Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?

Data Visualization, Microsoft Technologies, Power BI

Turning a Corporate Color Palette into a Data Visualization Color Palette

Last week, I had a conversation on twitter about dealing with corporate color palettes that don’t work well for data visualization. Usually, this happens because corporate palettes are designed with websites and/or marketing collateral in mind rather than information graphic design. This often results in colors being too bright, dark, or dull to be used together in a report. Sometimes the colors aren’t easily distinguishable from each other. Other times, the colors needed for various situations (main color, ancillary colors, highlight color, error color, KPIs, text, borders) aren’t available in the corporate palette.

You can still stay on brand and create a consistent user experience with a color palette optimized for data visualization. But you may not be using the exact hex values as defined in the corporate palette. I like to say the data viz color palette is “inspired by” the marketing color palette.

I asked on twitter if anyone had a corporate color palette they needed to convert into a data visualization palette, and someone volunteered theirs. So this post is my walk-through of how I went about creating the palette.

Step 1: Identify a Main Color

There is often a main color in the corporate color palette. If that color is a medium intensity color, I usually include that color in my color palette as is. If it is excessively dark, light, or gray, I’ll either tweak the color a bit or use the second color in the color palette.

Step 2: Choose a Color Scheme

Next, I need to decide what kind of relationship the other colors will have with the main color. In other words, I have to decide what type of color scheme I want to use. I tend to go for monochromatic or analogous color schemes. Complimentary color schemes can be difficult, depending on your main color. I generally try to stay away from using reds and greens together in the same palette because it’s hard to stay colorblind-friendly and because the primary colors together can make it feel like a Christmas or kindergarten theme. I often try to reserve reds and oranges to draw attention to specific data points, but that isn’t a hard and fast rule.

I need 2 – 4 ancillary colors to go with my main color. I rarely need to use all 4 colors together in one chart, but there are some cases such as line charts with 4 series where that will be necessary. People can preattentively distinguish up to about 7 colors at once, so I need to use fewer than 7 colors in a single chart. If I encounter a situation where I feel like I need more than 4 colors together, I re-evaluate my choice of chart type and my use of color within the chart.

Also, I want the colors to be roughly the same level of brightness and intensity. Most importantly, the colors need to be easily distinguishable from each other.

RGB Color Wheel

Step 3: Choose Highlight and Error Colors

We often need to draw attention to specific data points to indicate that they require attention. This is usually because a value is outside of the expected range. KPIs are common in Power BI reports, I need to make sure I have a color to indicate “bad” statuses. I also like to have a highlight color that doesn’t necessarily mean “bad”, just “look here”. These highlight and error colors need to be noticeably different from my other colors so that they draw attention to the data points where they are used.

Step 4: Add Border and Background Colors

I like to add grays and browns to go with my color scheme. I’ll use them mostly for borders, grid lines, text, and light background shades. But also, I want to make sure I have 8 colors in my palette. If I have fewer than 8 colors, Power BI will add colors from the default palette at the end of my colors to fill out the full 8 columns.

Color Palette Creation Example

The original corporate color palette that I was given had a lot of colors.

Primary Corporate Colors
Secondary Corporate Colors

The primary colors go all the way around the color wheel. I definitely don’t want to use them all together. The secondary colors have the beginnings of a monochromatic blue palette, an analogous blue/green palette, or an analogous orange/red/purple palette.

I don’t need all of these different hues. I need 8 medium-intensity colors. Power BI will add black and white and provide the shades and tints for me.

Main Color

I’m keeping the main color as it is. It is bright and saturated enough to not be dull/boring and also not so bold as to leave no room for bolder colors to be used to highlight specific data points.

Chart using the main color of my palette

Color Scheme

I choose an analogous color scheme, which means I pick colors that are next to my main color on the color wheel. Since blue is my main color, I stick with cool colors for the ancillary colors.

Main color plus 3 ancillary colors

I want my 4 colors to be easily distinguishable from each other, and I want them to be roughly the same intensity and brightness.

Highlight and Error Colors

I’m adding yellow and red to my palette. The yellow can be a generic highlight color as well as a “caution” color. The red can be my “bad” color. I’m checking that my colors are easily distinguishable for various types of color vision deficiency.

I confirm that my highlight and error colors are easily distinguishable from the other colors for the most common types of color vision deficiency. I can also see here that my second and fourth colors look a bit similar on the deuteranopia line, so I’ll have to be careful how I use them together, perhaps switching to a shade or tint of the fourth color if needed.

Border and Background Colors

Now I add my grays and browns to use for formatting. This completes my color palette.

Full color palette for Power BI Theme

Power BI Theme

I can take the hex values for my colors and drop them in the color theme generator on PowerBI.Tips to get my JSON theme file.

{
"name":"Analogous",
 "dataColors":["#00aaff", "#00c2b2", "#213dca", "#7514ff", "#ffd500", "#ff002b","#768389","#987665"]
 }

When I import my theme file into my Power BI report, I get the additional tints and shades from the colors I provided.

Next I try out my new color theme in a report to see if I need to tweak any colors. This is the true test. The colors may look great in little boxes, but they might need to be altered to work on a full report page. The shade of purple that I used originally (not shown in this blog post) was a bit too intense compared to the other colors, so I replaced it with a slightly muted tint that better matched the other colors. That is the type of thing you will notice when applying your theme to a report. Don’t get too stuck on finding the exact perfect colors. Colors look slightly different on different screens. Just make sure nothing is inadvertently distracting.

Helpful Color Tools

I’m currently using https://color.mediaandme.be to create my color palettes. It’s free, and it allows me to add many (> 6) colors to my palette. Other benefits:

  • It shows me what all the colors look like together
  • It provides a colorblindness simulator
  • It lets me easily tweak hue, saturation, and brightness
  • It generates a link for the color palette I create so I can easily share it with others for feedback

When I need ideas for how to tweak a color, I use https://www.colorhexa.com. I picked the gray color in my palette by getting the grayest tone of my main color from ColorHexa.

Further Reading on Color in Data Visualization

Tiger Color: Basic Color Schemes – Introduction to Color Theory
Maureen Stone: Expert Color Choices
Billie Gray: Another post about colours for data visualisation. Part 3 — DIY Palettes
Christopher Healy: Perception in Visualization