Azure SQL DB, Azure SQL DW, Data Warehousing, Microsoft Technologies, Python, SQL Server

pandas.DataFrame.drop_duplicates and SQL Server unique constraints

I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.

Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.

Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").

This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.

Case sensitivity in SQL Server

SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.

Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.

CREATE TABLE [dbo].[Products] (
    [ProductName] nvarchar(100) NOT NULL
)
GO

ALTER TABLE [dbo].[Products] 
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)

I can insert my first row.

Insert into dbo.Products (ProductName)
values ('abc123')

Now I try to insert another row where the letters are capitalized.

Insert into dbo.Products (ProductName)
values ('ABC123')

This fails with the following error:

Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).

This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).

Potential solution

If you need to get to a list of case insensitive distinct values, you can do the following:

  1. Create a column that contains the values of the ProductName converted to lower case.
  2. Run the dropDuplicates on the lowercase column.
  3. Drop the lowercase column.

You can see code on this Stack Overflow solution as an example (as always, please read it and use at your own risk).

Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.

Azure, Microsoft Technologies, SQL Server

Check if File Exists Before Deploying SQL Script to Azure SQL Managed Instance in Azure Release Pipelines

I have been in Azure DevOps pipelines a lot recently, helping clients set up automated releases. Many of my clients are not in a place where automated build and deploy of their SQL databases makes sense, so they deploy using change scripts that are reviewed before deployment.

We chose release pipelines over the YAML pipelines because it was easier to manage for the client and pretty quick to set up. While I had done this before, I had a couple of new challenges:

  • I was deploying to an Azure SQL managed instance that had no public endpoint.
  • There were multiple databases for which there may or may not be a change script to execute in each release.

This took a bit longer than I expected, and I enlisted my coworker Bill to help me work through the final details.

Setup

There are a few pre-requisites to make this work.

Self-hosted Agent

First, you need to run the job on a self-hosted agent that is located on a virtual machine that has access to the managed instance. This requires downloading and installing the agent on the VM. We ended up installing this on the same VM where we have our self-hosted integration runtime for Azure Data Factory, at least for deployment from dev to test. This was ok because the VM had enough resources to support both, and deployments are not long or frequent.

When you create a new self-hosted agent in Azure DevOps, a modal dialog appears, which provides you with a link to download the agent and a command line script to install and configure the agent.

Dialog from Azure DevOps titled "Get the agent". There are options for Windows, macOS, and Linux. There is a button to copy a link to download the agent. Then there is a set of command line scripts to create the agent, and configure the agent.
Dialog showing the instructions to download and install the Azure Pipelines Self-hosted agent

We configured the agent to run as a service, allowing it to log on as Network Service. We then could validate it was running by opening the Services window and looking for Azure Pipelines Agent.

You must add an inbound firewall rule for the VM to allow the traffic from the IP addresses related to dev.azure.com. This can be done in the Azure portal on the Networking settings for the VM.

Then you must install the SqlServer module for PowerShell on the VM. And you must also install sqlpackage on the VM. We also updated the System Path variable to include the path to sqlpackage (which currently gets installed at C:\Program Files\Microsoft SQL Server\160\DAC\bin).

DevOps Permissions

Azure DevOps has a build agent with a service principal for each project. That service principal must have Contribute permissions on the repo in order to follow my steps.

Database Permissions

If you want the Azure DevOps build agent to execute a SQL script, the service principal must have appropriate permissions on the database(s). The service principal will likely need to create objects, execute stored procedures, write data, and read data. But this depends on the content of the change scripts you want to execute.

The Pipeline

In this project we had one pipeline to release some analytics databases and Azure Data Factory assets from dev to test.

Therefore, we included the latest version of the repos containing our database projects and data factory resources as artifacts in our pipeline. While the database objects and data factory resources were related, we decided they weren’t dependent upon each other. It was ok that one might fail and one succeed.

Azure DevOps release pipeline with 2 artifacts (database code and ADF code) and 2 stages (Database deployment and ADF Deployment).

Side note: I love the ADF Tools extension for Azure DevOps from SQL Player. It is currently my preferred way to deploy Azure Data Factory. I find it is a much better experience than ARM template deployment. So that is what I used here in this ADF Deploy job.

Repo Setup

Getting back to databases, we had 3 databases to which we might want to deploy changes. We decided that we would set a standard of having one change script per database. We have an Azure DevOps repo that contains one solution with 3 database projects. These databases are highly related to each other, so the decision was made to keep them all in one repo. I added a folder to the repo called ChangeScripts. ChangeScripts contains two subfolders: Current and Previous. When we are ready to deploy database changes, we add a .sql file with the name of the database into the Current folder. For any deployment, there may be 1, 2, or 3 change scripts (1 for each database that has changes). When the release pipeline runs, the change scripts in the Current folder are executed and the files are renamed and moved into the Previous folder. This is why my DevOps service principal needed Contribute permissions on the repo.

Here are the tasks in my Database Deployment job:

  1. Check if SQL Scripts exist in the expected location.
  2. If a script exists for Database1, execute the script.
  3. If a script exists for Database 2, execute the script.
  4. If a script exists for Database 3, execute the script.
  5. Rename the change scripts to fit the pattern [DatabaseName][yyyyMMddHHmm].sql and move them to the Previous folder.

Configuring the Tasks

The first thing we must do is set the agent job to run using the agent pool containing the self-hosted agent. There is another important setting under Additional options for Allow scripts to access the OAuth token. This allows that service principal to authenticate and execute the SQL scripts, so make sure that is checked.

My first task, called Check for SQL Scripts is a PowerShell task. I used an inline script, but you can also reference a file path that is accessible by the agent. My script looks for the three files and then sets a pipeline variable for as true or false to indicate the presence each file. This variable is used later in the process.

My task is configured with the working directory set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current. There are no environment variables or output variables.

Here’s my PowerShell script:

$DB1Script = "Database1.sql"
$DB2Script = "Database2.sql"
$DB3Script = "Database3.sql"

$IsDB1Script = Test-Path -Path $DB1Script -PathType Leaf
$IsDB2Script =  Test-Path -Path $DB2Script -PathType Leaf
$IsDB3Script = Test-Path -Path $DB3Script -PathType Leaf

Write-Host $IsDB1Script  $IsDB2Script $IsDB3Script

Write-Output ("##vso[task.setvariable variable=DB1Script;]$IsDB1Script")

Write-Output ("##vso[task.setvariable variable=DB2Script;]$IsDB2Script")

Write-Output ("##vso[task.setvariable variable=DB3Script;]$IsDB3Script")

We got stuck on the syntax for setting variables for a while. Most of the things we found in Microsoft Docs and on blogs were close, but didn’t work. Finally, we found this blog, which showed us the syntax that worked – be careful with the quotes, semi-colon, and parentheses.

My second task executes the script for Database1. It uses the Azure SQL Database deployment task. When you use this task, you can specify whether it deploys a dacpac or executes a SQL script. We chose to execute a SQL script that we stored in the same repo as the database projects. For this task, you must specify the Azure Subscription, the target server, and the target database. I set my authentication type to Service Principal, since I had granted access to the Azure DevOps service principal during setup. My Deploy type was set to SQL Script File. And my SQL Script location was set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current/Database1.sql. An important configuration on this task is to set Run this task to Custom Conditions. I specified my custom condition as eq(variables['DB1Script'], 'True'). This tells the task to execute only if the variable we set in the previous task is equal to True. It seems data types don’t really matter here – I set the variable in the PowerShell script as a boolean, but it was converted to a string in Azure DevOps.

My third and fourth tasks do the same activities for Database2 and Database3.

My last task, called Move Change Scripts to Previous Folder, is another inline PowerShell script. The script uses the variables set in the first task to determine if the file exists and needs to be moved. Because the script is working on a local copy on the VM, it must commit the changes back to the git repo after renaming and moving the file(s). The Working Directory for this task is set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts.

Here’s the code:

# Write your PowerShell commands here.

Write-Host "Getting Files"
$path = Get-Location
Write-Host $path


$File1 = "Current/Database1.sql"
$File2 = "Current/Database2.sql"
$File3 = "Current/Database3.sql"

$IsDB1Script = Test-Path -Path $File1 -PathType Leaf
$IsDB2Script = Test-Path -Path $File2 -PathType Leaf
$IsDB3Script = Test-Path -Path $File3 -PathType Leaf

$DateString = Get-Date -Format "yyyyMMddHHmm"

$NewFile1 = "Previous/Database1$($DateString).sql"
$NewFile2 = "Previous/Database2$($DateString).sql"
$NewFile3 = "Previous/Database3$($DateString).sql"


 Write-Host "Changing file locations"
    git config --global user.email "myemail@domain.com"
    git config --global user.name "My Release Pipeline"

   
if($IsDB1Script) 
{
    $File1 |  git mv ($File1) ($NewFile1) 
    Write-Host "Moved Database1 script"
}

if($IsDB2Script)
{
    $File2 |  git mv ($File2) ($NewFile2) 
  Write-Host "Moved Database2 script"
}

if($IsDB3Script)
{
  $File3 |  git mv ($File3) ($NewFile3) 
  Write-Host "Moved Database3 script"
}

git commit -a -m 'Move executed change scripts to Previous Folder'
git push origin HEAD:master

Anything you output from your PowerShell script using Write-Host is saved in the release logs. This is helpful for troubleshooting if something goes wrong.

As you can see above, you can execute git commands from PowerShell. This Stack Overflow question helped me understand the syntax.

And with those five steps, I’m able to execute change scripts for any or all of my three databases. We don’t anticipate adding many more databases, so this configuration is easily manageable for the situation. I wouldn’t do this (one step per database) for a situation in which I have dozens or hundreds of databases that may be created or deleted at any time.

Azure, Azure SQL DB, Microsoft Technologies, SQL Server

New Centralized View of SQL Resources in Azure

Yesterday some new views were made available in the Azure portal that will be helpful to those of us who create or manage Azure SQL resources.

First, a new guided approach to creating resources has been added to the Azure portal. We now have a unified experience to create Azure SQL resources that offers guidance as to the type of Azure SQL resource you need for your use case: SQL database, managed instance, or SQL Server virtual machine. This new Azure SQL blade under Marketplace offers a high-level description of each offering and the scenario that it best serves. If you already know what you want but are having trouble remembering exactly what the resource is called in the marketplace, this can also alleviate that issue.

New guided experience for creating SQL Azure resources in the Azure Portal
New guided experience for creating SQL Azure resources in the Azure Portal

Notice that SQL virtual machine images are a listed offering in the new experience. As Microsoft phrased it, “SQL Server on Azure VMs is now a first-class member of the Azure SQL family.” This blade gives you an easily accessible place to see all the SQL Server VM images without having to search through lots of other unrelated VM images.

A drop-down box listing all of the available SQL Server VM images
A drop-down box listing all of the available SQL Server VM images

Once your Azure SQL resources are created, you can use the new centralized management hub to administer them. Locate the Azure SQL resources blade to see a list of all of your single databases, database servers, elastic pools, managed instances, and virtual machines running SQL.

Centralized management hub for Azure SQL resources
Centralized management hub for Azure SQL resources

This is the foundation for a unified database platform in Azure with more consistency across offerings and more manageability features to come in the future. For more information, read the announcement from Microsoft or watch the new video they posted on Channel 9.

Azure, Microsoft Technologies, PowerShell, SQL Server

Using Azure Automation to Shut Down a VM only if a SQL Agent Job is Not Running

I have a client who uses MDS (Master Data Services) and SSIS (Integration Services) in an Azure VM. Since we only need to execute the SQL Agent job that runs the SSIS packages infrequently, we shut down the VM when it is not in use in order to save costs. We wanted to make sure that the Azure VM did not shut down when a specific SQL Agent job was still running, so I tackled this with some PowerShell runbooks in Azure Automation.

I split the job into two parts. The first runbook simply checks if a specified SQL Agent job is running and returns a text value that indicates whether it is running. A parent runbook checks if the VM is started. If the VM is started, it calls the child runbook to check if the job is running, and then shuts down the VM if the job is not running.

It’s fairly easy and convenient to have nested PowerShell runbooks in Azure Automation. There are two main ways to call a child runbook.

  1. Inline execution
  2. Using the Start-AzureRmAutomationRunbook cmdlet

It was less obvious to me how to call a child runbook when the parent runs in Azure and the child runs on a hybrid worker, especially when you need to use the output from the child runbook in the parent. A hyrid runbook worker allows us to access resources that are behind a VNET or on premises.

Travis Roberts has a nice video on just this topic that gave me the answers I needed.

Below is my parent runbook.

# Ensures you do not inherit an AzureRMContext in your runbook
Disable-AzureRmContextAutosave –Scope Process

$connection = Get-AutomationConnection -Name AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -Tenant $connection.TenantID `
-ApplicationID $connection.ApplicationID -CertificateThumbprint $connection.CertificateThumbprint

$rgName='MyResourceGroup'
$vmName='MyVM'
$SubID = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'

$AzureContext = Select-AzureRmSubscription -SubscriptionId $SubID
'Check if VM is on'
$vm=((Get-AzureRmVM -ResourceGroupName $rgName -AzureRmContext $AzureContext -Name $vmName -Status).Statuses[1]).Code
 $vm 
 if ($vm -eq 'PowerState/running')
 {
    Do 
    {
        #if VM running call other runbook
        start-sleep -Seconds 60;
        'Check if job is running'
        $JobRunning = start-azureRMautomationrunbook -AutomationAccount 'ProgramsAutomation' -Name 'CheckRunningSQLJob' -ResourceGroupName $rgName -AzureRMContext $AzureContext -Runon 'Backups' -Wait;
        Write-Output $JobRunning
        
        
    } Until ($JobRunning -eq 'run0')
    
    'Stopping VM'
    stop-azurermvm -Name $VMname -ResourceGroupName $RgName -force
}

The runbook sets the Azure context to the appropriate subscription (especially important when you are a guest user in someone else’s tenant). Then it checks if the VM is started. If it is, it goes into a do-while loop. This task isn’t super time sensitive (it’s just to save money when the VM isn’t in use), so it’s waiting 60 seconds and then calling the child runbook to find out if my SQL Agent job is running. This makes sure that the child runbook is called at least once. If the result is that the job is not running, it stops the VM. If the job is running, the loop starts over, waiting 60 seconds before checking again. This loop is essentially polling the job status until it sees that the job is completed. One thing to note is the -Wait parameter on the end of that Start-AzureRmAutomationRunbook command. If you don’t specify the -Wait parameter, the command will immediately return a job object. If you specify the -Wait parameter, it waits for that child job to complete and returns the results of that job.

And here is my child runbook.

[OutputType([string])]

$SQLJobName = 'MySQLAgentJobName'
$SQLInstanceName = 'MySQLServer

$cred=Get-AutomationPSCredential -Name 'mycredential'
 
$server = Connect-DbaInstance -SqlInstance $SQLInstanceName -SqlCredential $cred
 
Get-DbaRunningJob -SqlInstance $server | Get-DbaRunningJob

$JobStatus = (Get-DbaRunningJob -SqlInstance $server).Name -match $SQLJobName

If ($JobStatus -ne $false) 
{
#job is running. Passing back a string because bits and ints were causing issues.
    $JobRunning = 'run1'
    Write-Output $JobRunning 
}
else 
{ 
#job is idle
    $JobRunning = 'run0'
    Write-Output $JobRunning 
}

I’m using dbatools to check if the job is running on the server. That is the Get-DBARunning Job command. The important part to note is that you have to use the Write-Output command for this output to be available to the parent runbook. I got some weird results when I tried to return an int or a boolean (it was returning an object rather than a single value), so I just went with a string. The string, while not the most efficient, works just fine. If you understand why this is, feel free to leave me a comment.

These runbooks have been in place for a couple of months now, and they are working great to shut down the VM to save money while making sure not to disturb an important SQL Agent job that might occasionally run late. I didn’t find much documentation nor many examples of using output from a child job that runs on a hybrid worker, so I wanted to get this published to help others that go searching.

Data Warehousing, Microsoft Technologies, SQL Server, SSIS, T-SQL

Why We Don’t Truncate Dimensions and Facts During a Data Load

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Surrogate Keys

Dimensional models use surrogate keys rather than natural keys. Surrogate keys are system-generated, meaningless values that are usually integers used to uniquely identify a record. They provide good performance for joins in queries, allow us to switch or use multiple source systems to feed the same tables, and facilitate the use of slowly changing dimensions. If you truncate a dimension table and then repopulate it, you will end up with different surrogate keys assigned to your dimension values. Let’s say we have a Geography dimension that looks like the table below.

GeographyKeyCityStateProvinceCountryPostalCode
1DenverColoradoUnited States80205
2DallasTexasUnited States75201
3ParisÎle-de-FranceFrance75001

If I truncate and reload this table, there is no guarantee that Denver will have a GeographyKey value of 1. I might reload the data and then have Paris as 1, Denver as 2, and Dallas as 3. And since we use surrogate keys as dimension lookups in fact tables and bridge tables (and snowflaked dimensions, if you use those), I would now need to update every table that references this dimension. That’s a lot of table updates without a good reason.

Error Handling

Another reason to avoid truncating the tables in your dimensional model is error handling. When you design your data load, you need to think about what should happen when it fails and where it is most likely to fail. Failures will often occur either in data transformation steps or upon inserting/updating values in the destination table due to data type/size conflicts.

Let’s say you truncate your fact table and then you encounter an ETL failure while performing the transformations to reload the table. Now you have an empty fact table. If you have error handling logic in your ETL, you may be able to redirect the error rows to another location to be handled later. But that still leaves you with an incomplete table at best. If instead of truncating and reloading, you were doing inserts and updates when you encountered an error, your table might reflect slightly stale data as of the previous load. You still might have a situation where your table is partially updated with the rows that were inserted before the error was encountered depending on how your ETL design. But having a fact table with old or partially updated data is usually (but not always) more preferable than having no data when a data load process fails.

Performance and Data Availability

When you truncate and reload a table, you are assuming you have access to all of the source data needed to fully repopulate the table. This may be true today, but what about next year when you switch CRM systems? Or what if your organization makes the decision to archive data in the source system that is older than 5 years? If any data needed for your fact or dimension becomes unavailable, your truncate and reload pattern will fail to serve your needs.

Even if you are sure your data will be available, you may want to consider your data load times. If you only have a handful of tables with a few hundred thousand rows max, you may load your data warehouse in a few minutes. But what happens when you have a few hundred million rows with some complicated transformation logic in your pipelines? You might be adding minutes to hours to your load times.

Why Do ETL Developers End Up Truncating Fact Tables?

Sometimes developers just don’t know better. But often there were understandable reasons for using the truncate and reload pattern. While I have never built a system where I truncated a dimension, I have had a couple where we at least started out with truncating the fact table. We usually built it to make data initially available with correct values, and then worked with stakeholders to find a different way to access the data or to have them generate the data differently to alleviate the problems that caused us to want to truncate and reload.

Often the cause is data quality issues. On one project I worked, we had dimension values that defined the granularity of the fact table that would change in the source systems, and the process to try to update them in the fact table was too complicated and took too long. So we made the decision to truncate and reload, understanding the risks of doing so. Having the data available and usable for analysis gave us more information on how useful the data was to users and helped us work to understand why and how our keys were changing. Over time, we were able to influence the way data was entered into source systems so that we didn’t have to go through the truncate and reload process and could perform incremental loads on our fact table.

Sometimes source systems allow hard deletion of data and don’t maintain a list of what was deleted. And depending on access methods and the size of the data, it can be difficult to compare the rows in a fact table with the source data to identify the deleted rows. I would argue this is a poor design of the source system, but we often can’t change that, and we still need to include the data in our data warehouse. So it’s understandable why someone in this situation would want to truncate and reload a fact table.

If your stakeholders are ok with the potential of long load times, empty tables when errors occur, and the assumption that source data will always be available, then there isn’t much problem with truncating and reloading a fact table. But that is often not the case. And that decision should be made explicitly rather than implicitly.

What To Do Instead of Truncating and Reloading a Fact Table

If you’ve been truncating and reloading a fact table because that just seemed like the simplest thing to do, you can change your load pattern.

My general load pattern is:

  1. Truncate the update table
  2. Insert new rows into the fact table and changed rows into the update table
  3. Perform set-based updates on the fact table based upon the data in the update table

I’m able to determine what rows are new, changed, and unchanged by using hash values. I concatenate and hash the values that define the level of uniqueness of the row into one value I call HistoricalHash and the values from the remaining columns into a value I call ChangeHash. In other words, I load a transactional or periodic snapshot fact table in a manner similar to a Type 1 slowly changing dimension.

If you have data quality, data deletion, or other issues that prevent you from using a change detection pattern like the above, consider using a staging table and swapping it out with the fact table. Create a staging table that looks exactly like your fact table.

  1. Truncate the staging table
  2. Populate the staging table with all rows that should be in the fact table
  3. Swap the staging table and the fact table (this usually involves renaming the tables)

Understand the Consequences of Your Design

If you are choosing to truncate a dimension or fact table, be sure that you understand the trade-offs. You may be unwittingly opting for simplicity of ETL over data availability and efficient data loads.

Microsoft Technologies, SQL Server

Insufficient Disk Space (T-SQL Tuesday #88)

TSQLTuesdayThis month’s T-SQL Tuesday – hosted by Kennie T Pontoppidan(@KennieNP) – is called “The daily (database-related) WTF“. He asked us to be inspired by the IT horror stories from http://thedailywtf.com, and tell our own daily WTF story.

Years ago in a previous job, I worked at a company that had no DBAs. I am/was a BI developer, so I know my way around a database, but I wasn’t dedicated to keeping all databases in good health. There were several application developers at this company (mostly focused on .NET and Javascript) who built applications with SQL Server databases as the back end. And there was a guy who acted as a system admin among his many other duties. The application developers had built a web app that was to be used by users around the world. The application had been launched and things were fine for several weeks. I wasn’t involved with the project, but I was aware of it.

One day, a manager asked me if I could help on an urgent matter: the application suddenly could no longer execute transactions on the production database and the database connection was intermittently failing. The system admin was busy with other duties, so I was the closest thing they had to a DBA.  All they could tell me was the production database had crashed and they got an error message about insufficient disk space.

I logged on to the server that housed the database to see what was going on. The server itself had been set up appropriately and seemed to have sufficient memory and CPU to support the load of this application. I saw 3 volumes on the server: a C volume for application and system files, a large F volume for data, and a large G volume for logs.

I connected to the database with Management Studio to do some more digging. The first thing I noticed is that the dev, test, and prod databases for this application were all on the same SQL Server instance. The dev and test databases weren’t very large, so while that wasn’t what I would have recommended, that didn’t seem to be the main problem. As I looked at the prod database, I noticed that the MDF and LDF files were sitting on the C volume rather than the spacious F volume that was made for them! The person who configured the server hadn’t made the C volume very large since user databases weren’t supposed to be there.

Then I looked at the size of the log file. It was huge! A bit more digging revealed that they had left all the defaults on the database for full recovery and autogrowth of the log file, but they had never done a transaction log backup. (Sidenote: You can check the Log_Reuse_Wait_Desc column in sys.databases to verify the database is waiting on a transaction log backup.) The developers had worked long and hard to get the application up and running and hadn’t quite finished up the maintenance and disaster recovery tasks.

Once I knew what I was dealing with, I was able to fix the problem. A full backup and a log backup later we were back in business. I went ahead and shrunk the log file back to a reasonable size (please remember this is reserved for special occasions). I took the database offline (which was acceptable since the application was currently unusable anyway), moved the MDF and LDF files to their rightful home, and brought it back online. A lesson on recovery models and setting up SQL Agent jobs that scheduled such backups ensured this didn’t happen again anytime soon.

This should be a good reminder to have a healthy respect and understanding for your database settings and to make sure you have (and test) your backups (both full and transaction logs) for your production databases.

Azure, Microsoft Technologies, SQL Server

PolyBase Is A Picky Eater – Remove Carriage Returns Before Ingesting Text

Update: As Gerhard points out in the comments, switching to ORC files solves this issue nicely. It’s not human readable, but it is much less error-prone when reading in data. 

I’ve spent the last few weeks working on a project that used PolyBase to load data from Azure Blob Storage into Azure SQL Data Warehouse. While it’s been a great experience, I must note that PolyBase is a picky eater.

4184268 - finicky or picky bulldog pouting beside full bowl of dog food
PolyBase is a picky eater, just like this bulldog.

Creating the external tables is fairly simple if you are already familiar with T-SQL. The syntax for creating external data sources, file formats, and tables is fairly well documented on MSDN. It’s important to be aware of features and data types that aren’t supported, but I found it to be pretty smooth sailing beyond that.

But just because you have successfully created the external tables does not mean you are finished. That is when the “fun” begins. If you would like more information on why “fun” is in quotes, read Grant Fritchey’s blog post on Loading Data into Azure SQL Data Warehouse.

You should test after populating any table in SQL Server, but I think this is especially true with external tables. More than likely you will find that you must resolve several issues with source file contents and external table definitions.

First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.

One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.

To elaborate further, I had several fields that originally came from a SQL Server table and were of type text. Some of the values in these fields contained newlines (\n) as users had typed paragraphs and addresses into the fields in the source application. The data from the source tables was exported from SQL Server to Azure Blob Storage using Azure Data Factory with a simple copy pipeline with no modifications to the data. The problem is that Hive, PolyBase, and several other tools have issues reading strings with newlines the value. They immediately interpret it as the end of the row. There is no escape character or setting you can use to allow newlines in the values.

If you find yourself in a similar situation, trying to load data from delimited files into Azure SQL DW and realizing you have newlines inside of string fields, there are two things you can do.

  1. Fix the data in the flat files so it doesn’t contain new lines in string fields.
  2. Switch to a different tool to load data to Azure DW. Azure Data Factory can take the data from blob storage and import it into a normal table in Azure DW.

In most circumstances, I would go for option #1. Option #2 only fixes things in Azure DW, leaving other tools in the environment to deal with the issue separately, and it requires storing a copy of the data in the DW.

In my project, I changed the ADF pipelines to replace newlines with an acceptable character/set of characters that doesn’t often appear in my data set and doesn’t obscure the values. We chose to replace them with 4 spaces. It’s important to understand that this means that your data in your blob storage will no longer exactly match its source. This is something you will want to document because it will surely pop up somewhere in the future.

Updating the ADF pipelines is not much effort. If my table definition is

CREATE TABLE TableA (
    Column1 int,
    Column2 varchar(10),
    Column3 text
)

and my original pipeline sqlReaderQuery was SELECT * FROM TableA, I just need to enumerate my fields, convert the text fields to varchar, and replace the new lines.

SELECT
    [Column1],
    [Column2],
    [Column3] = REPLACE(
                  REPLACE(
                    CAST([Column3] AS varchar(8000)), 
                      CHAR(13), '    '), 
                      CHAR(10), '    ')
FROM TableA

If you are using time slices in ADF and you have your query inside of the Text.Format() function, you will find that ADF doesn’t allow the single quotes around the four spaces in your JSON. You can instead use CHAR(32) instead of a space. If you have a better way of accomplishing this, please leave me a note in the comments.

SELECT
    [Column1],
    [Column2],
    [Column3] = REPLACE(
                  REPLACE(
                    CAST([Column3] AS varchar(8000)), 
                      CHAR(13), CHAR(32) + CHAR(32) + CHAR(32) + CHAR(32)), 
                      CHAR(10), CHAR(32) + CHAR(32) + CHAR(32) + CHAR(32))
FROM TableA

In addition to updating the ADF pipelines, I also had to replace the newlines in my existing files in blob storage. Since there weren’t many of them, I just opened them up in Notepad++ and did a find & replace. If there had been more files, I would have looked into a more automated solution.

If the ability to allow field/row terminators within string fields is something you would like to see in the PolyBase, please voice your opinion by casting a vote on the feedback site.

BIDS Helper, Biml, Data Warehousing, Microsoft Technologies, SQL Server, SSIS

Type 6 or Hybrid Type 2 Slowly Changing Dimension with Biml

In my previous post, I provided the design pattern and Biml for a pure Type 2 Slowly Changing Dimension (SCD). When I say “pure Type 2 SCD”, I mean an ETL process that adds a new row for a change in any field in the dimension and never updates a dimension attribute without creating a new row.  In practice, I tend to create more hybrid Type 2 SCDs where updates to some attributes require a new row and others update the value on the existing rows. A similar pattern that I find I implement more often than a pure Type 2 is a Type 6 SCD. A Type 6 SCD builds on the Type 2 technique by adding current attributes alongside the historical attributes so related measures can be grouped by the historical or current dimension attribute values. The only difference between what I call a hybrid Type 2 and a Type 6 is that in the Type 6, there are no Type 1 attributes in the dimension that do not also have a Type 2 version in the dimension to capture the historical values.

Design Pattern

Control Flow

If you are comfortable with my design pattern for a pure Type 2 SCD in which a change of value in any column causes a new row, this pattern is quite similar. And the control flow is exactly the same. This pattern, as with my pure Type 2, assumes that rows are not deleted from the source system. You could easily modify this to check for deleted rows if needed.

Control Flow for a Hybrid Type 2 or Type 6 Dimension
Control Flow for a Hybrid Type 2 or Type 6 Dimension

The steps in the control flow are:

  1. Log package start.
  2. Truncate the update table.
  3. Execute the data flow.
  4. Execute the update statements to update columns and insert new rows.
  5. Log package end.

The update statements are different in this pattern, and I’ll explain those in detail below.

Data Flow

The data flow looks like a pure Type 2 SCD, with the exception of an added derived column transformation and minor changes to the lookup and conditional split. Again, I use views to integrate the data, apply business logic, and add hashkeys for change detection. Then I use SSIS to perform the mechanics of loading the data.

The steps in this data flow task are:

  1. Retrieve data from my source view.
  2. Count the rows for package logging purposes.
  3. Perform a lookup to see if the entity already exists in the dimension table.
  4. If the entity doesn’t exist at all in the dimension table, it goes into the left path where I count the number of rows, add a derived column that sets the row start date to “01/01/1900 00:00:00”, and then insert the row into the dimension table.
  5. If the entity does exist in the table, I check it for changes.
  6. If there are changes to the entity, I count the number of rows, us a derived column to flag the type(s) of changes to make, and then insert the row into an update table.
  7. Entities with no changes are simply counted for audit purposes.

The Source View

This SSIS pattern requires 3 hashed values for for change detection:

  • HistoricalHashKey: the unique identifier of the entity, the natural key that ties the historical rows together
  • ChangeHashKey: the columns on the dimension that cause a new row to be created and the current row to be expired
  • UpdateHashKey: the columns on the dimension that should be updated in place

In my example view below, the Route ID and Warehouse identify a unique route.  The supervisor, route description and route type are all important attributes of the route.  The route area identifies the metro area in which a route is located. If this should change, there is no need to see previous values; we just want to see the current value on every row.

 CREATE View [dbo].[StgDimRoute] as   
    with [Routebase] as   
    (   SELECT [RouteID]  
    , [RouteDescription]  
    , [Supervisor]  
    , [RouteType]  
    , [Warehouse]  
    , [RouteArea]  
     , 1 as RowIsCurrent     
  FROM Stg.Routes R  
  Left JOIN dbo.StgWarehouse W ON W.WarehouseID = R.WarehouseID  
  UNION   
  select -1 as RouteID, 'Unknown' as RouteDescription, 'Unknown' as Supervisor,   
  'Unknown' as RouteType, 'Unknown' as Warehouse, 'Unknown' as RouteArea, 
  1 as RowIsCurrent 
      ) ,  
 Routedata as   
 (  
 Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse, 
   RouteArea, RowIsCurrent 
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT([RouteID], ' ',Warehouse ))) 
      AS HistoricalHashKey  
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(Supervisor, [RouteDescription], 
      RouteType))) AS ChangeHashKey  
  , Convert(VARBINARY(20), HASHBYTES('MD5', CONCAT(Area,' '))) AS UpdateHashKey  
  from Routebase sb  
  )   
 Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse,  
  RouteArea, RowIsCurrent  
  , HistoricalHashKey, ChangeHashKey, UpdateHashKey  
  , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  , CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
  , CONVERT(VARCHAR(34), UpdateHashKey, 1) AS UpdateHashKeyASCII  
  , dateadd(MINUTE, -1, current_timestamp) as RowEndDate
  , CURRENT_TIMESTAMP as RowStartDate   
  from Routedata ds  

The RowEndDate value in this view will be used for routes that require a current row to be expired since my pattern is the leave the end date of the current row null.

The Change Detection Lookup

The lookup in my DFT compares the HistoricalHashKeyASCII column from the source view with the varchar version of the HistoricalHashKey from the dimension table and adds two lookup columns: lkp_ChangeHashKeyASCII and lkp_UpdateHashKeyASCII to the data set.

Type 6 historical hash lookup

Rows that do not match are new rows; i.e., that route has never been in the dimension table before.  Rows that do match have a row in the dimension table and will then be evaluated to see if there are any changes in the values for that route.

Derived Column for New Rows

The no match output of the lookup are new rows for routes that are not in the dimension table. Since this is the first row in the table for that route, we want this row to be effective from the beginning of time until the end of time. The beginning of time in this data mart is 01/01/1900. Since the data is loaded multiple times per day, I use a date/time field rather than a date. If you only need the precision of a day, you can cut your row end date/time back to just a date. In my pattern, the current row has a null row end date, but you could easily add a derived column to set the end date to 12/31/9999 if you prefer.

Derived Column for Start Date

Conditional Split for Change Detection

This time, I have to check to see if both the ChangeHashKeyASCII and the UpdateHashKeyASCII match in my conditional split.

Type 6 CSPL

If both hashed values from the source view match the hashed values from the dimension table, then no changes are required and the row is simply counted for audit purposes.

If either hashed value doesn’t match, there is an update to be made.

Derived Column to Identify Change Types

We again compare the UpdateHashKeyASCII value from the source view with that of the dimension. If they don’t match, we set the UpdateInPlace flag to true. If the ChangeHashKeyASCII values don’t match, we set the UpdateNewRow flag to true. If a row has both types of changes, both types of updates will be made.

My update table contains the UpdateInPlace and UpdateNewRow columns, so I can reference these flags in my update statements.

The Update Statements

The update statements in the control flow take the changes from the update table and apply them to the dimension table. Three statements are executed in the Execute SQL Statement labeled SQL Update DimRoute.

 UPDATE A   
 SET  A.RouteArea = UA.RouteArea,   
      A.UpdateHashKey = UA.UpdateHashKey,   
      A.UpdtDtTm = CURRENT_TIMESTAMP   
 FROM  Updt.UpdtRoute AS UA   
     JOIN Dbo.DimRoute AS A   
      ON UA.HistoricalHashKey = A.HistoricalHashKey   
 WHERE UpdateInPlace = 1; 
  
UPDATE A   
 SET  RowIsCurrent = 0,   
     A.RowEndDate = UA.RowEndDate,   
     A.UpdtDtTm = CURRENT_TIMESTAMP   
 FROM  Updt.UpdtRoute UA   
     JOIN Dbo.DimRoute AS A   
      ON UA.HistoricalHashKey = A.HistoricalHashKey   
 WHERE A.RowIsCurrent = 1   
     AND UA.UpdateNewRow = 1;   

 INSERT INTO Dbo.DimRoute   
       (RouteID,   
        RouteDescription,   
        Supervisor,   
        RouteType,   
        Warehouse,   
        RouteArea,   
        RowIsCurrent,   
        RowStartDate,   
        RowEndDate,   
        HistoricalHashKey,   
        ChangeHashKey,   
        UpdateHashKey)   
 SELECT RouteID,   
        RouteDescription,   
        Supervisor,   
        RouteType,   
        Warehouse,   
        RouteArea,   
        1,   
        RowStartDate,   
        null,   
        HistoricalHashKey,   
        ChangeHashKey,   
        UpdateHashKey
FROM Updt.UpdtRoute WHERE UpdateNewRow = 1; 

The first statement updates the values for the columns designated to be updated in place by joining the update table to the dimension table based on the HistoricalHashKey column. This is the same as performing updates in a Type 1 SCD.

The second statement expires all the rows for which a new row will be added. The third statement inserts the new rows with the RowIsCurrent value set to 1 and the RowEndDate set to null.

The Biml

If you are using Biml, you know that you can create a design pattern for this type of dimension load and reuse it across multiple projects. This speeds up development and ensures that your Type 2 Hybrid or Type 6 dimensions are implemented consistently.

As usual, I have 3 Biml files that are used to create the SSIS package:

  • ProjectConnections.biml – contains all the project-level connections for the project
  • Dim2Hybrid.biml – contains the SSIS design pattern with code nuggets that parameterize it to make it reusable
  • CreateDim2HybridPackages.biml – calls Dim2Hybrid.biml and passes along the values to be used for each package

I’ve pasted Dim2Hybrid and CreateDim2Hybrid below.

Dim2Hybrid.Biml

 <#@ template language="C#" tier="2" #>  
 <#@ property name="PackageName" type="String" #>  
 <#@ property name="DstSchemaName" type="String" #>  
 <#@ property name="DstTableName" type="String" #>  
 <#@ property name="DstConnection" type="String" #>  
 <#@ property name="DataFlowSourceName" type="String" #>  
 <#@ property name="SrcConnection" type="String" #>  
 <#@ property name="SourceQuery" type="String" #>     
 <#@ property name="UpdateSchemaName" type="String" #>     
 <#@ property name="UpdateTableName" type="String" #>     
 <#@ property name="UpdateConnection" type="String" #>        
 <#@ property name="UpdateSQLStatement" type="String" #>      
     <Package Name="<#=PackageName#>" Language="None">  
   <Parameters>  
     <Parameter DataType="String" Name="ParentPackageID">00000000-0000-0000-0000-000000000000</Parameter>  
   </Parameters>  
   <Variables>  
     <Variable EvaluateAsExpression="true" DataType="String" IncludeInDebugDump="Exclude" Name="QualifiedTableSchema">"[" + @[User::SchemaName] + "].[" + @[User::TableName] + "]"</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditStart">EXECUTE [dbo].[PackageControlStart] @PackageName = ?, @PackageId = ?, @ParentPackageId = ?, @ExecutionId = ?, @StartTime = ?;</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditUpdate">EXECUTE [dbo].[PackageControlStop] @PackageId = ?, @ExecutionId = ?, @InsertRowQuantity = ?, @UpdateRowQuantity = ?, @UnchangedRowQuantity=?;</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountChanged">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountNew">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountSource">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountUnchanged">0</Variable>   
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="SchemaName"><#=DstSchemaName#></Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="TableName"><#=DstTableName#></Variable>  
   </Variables>  
   <Tasks>    
     <Dataflow Name="DFT_Insert<#=DstTableName#>">  
       <Transformations>  
         <RowCount Name="CNT_Changed_Rows" VariableName="User.RowCountChanged">  
           <InputPath OutputPathName="CSPL Check For Changes.ChangedRows" />  
     </RowCount>  
         <ConditionalSplit Name="CSPL Check For Changes">  
           <InputPath OutputPathName="LKP Historical Key.Match" />  
           <OutputPaths>  
             <OutputPath Name="ChangedRows">  
               <Expression>(ChangeHashKeyASCII != lkp_ChangeHashKeyASCII) || (UpdateHashKeyASCII != lkp_UpdateHashKeyASCII)</Expression>  
     </OutputPath>  
   </OutputPaths>  
     </ConditionalSplit>  
         <RowCount Name="CNT_New_Rows" VariableName="User.RowCountNew">  
           <InputPath OutputPathName="LKP Historical Key.NoMatch" />  
     </RowCount>  
       <DerivedColumns Name="Identify Change Types">  
           <InputPath SsisName="Derived Column Input" OutputPathName="CNT_Changed_Rows.Output" />  
           <Columns>  
             <Column Name="UpdateInPlace" DataType="Boolean">UpdateHashKeyASCII != lkp_UpdateHashKeyASCII ? (DT_BOOL)1 : (DT_BOOL)0</Column>  
             <Column Name="UpdateNewRow" DataType="Boolean">ChangeHashKeyASCII != lkp_ChangeHashKeyASCII ? (DT_BOOL)1 : (DT_BOOL)0</Column>  
   </Columns>  
     </DerivedColumns>  
         <Lookup Name="LKP Historical Key" NoMatchBehavior="RedirectRowsToNoMatchOutput" OleDbConnectionName="<#=DstConnection#>">  
             <DirectInput>SELECT  
   CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  ,CONVERT(VARCHAR(34), UpdateHashKey, 1) as UpdateHashkeyAscii  
 FROM  
  <#=DstSchemaName#>.<#=DstTableName#>  
     where rowiscurrent = 1  
  </DirectInput> 
           <Parameters>  
             <Parameter SourceColumn="HistoricalHashKeyASCII" />  
   </Parameters>  
            <ParameterizedQuery>select * from (SELECT  
  CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  ,CONVERT(VARCHAR(34), UpdateHashKey, 1) as UpdateHashkeyAscii  
 FROM  
  <#=DstSchemaName#>.<#=DstTableName#>  
     where rowiscurrent = 1) [refTable]  
 where [refTable].[HistoricalHashKeyASCII] = ?</ParameterizedQuery>  
            <InputPath OutputPathName="CNT_Source_Rows.Output" />  
           <Inputs>  
             <Column SourceColumn="HistoricalHashKeyASCII" TargetColumn="HistoricalHashKeyASCII" />  
   </Inputs>  
           <Outputs>  
             <Column SourceColumn="ChangeHashKeyASCII" TargetColumn="lkp_ChangeHashKeyASCII" />  
                   <Column SourceColumn="UpdateHashkeyAscii" TargetColumn="lkp_UpdateHashKeyASCII" />  
   </Outputs>  
     </Lookup>  
         <DerivedColumns Name="NewItemRowStartDate">  
        <InputPath OutputPathName="CNT_New_Rows.Output" />  
          <Columns>  
         <Column ReplaceExisting="true" Name="RowStartDate" DataType="DateTime">  
          (DT_DBTIMESTAMP)"01/01/1900 00:00:00"  
         </Column>  
        </Columns>  
       </DerivedColumns>  
            <OleDbDestination Name="OLE_DST_New_Rows" ConnectionName="<#=DstConnection#>">  
           <InputPath OutputPathName="NewItemRowStartDate.Output" />  
            <ExternalTableOutput Table="<#=DstSchemaName#>.<#=DstTableName#>" />   
                <Columns>  
                   <Column SourceColumn="RowEndDate" IsUsed="false"/>  
                </Columns>  
     </OleDbDestination>  
         <RowCount Name="CNT_Source_Rows" VariableName="User.RowCountSource">  
           <InputPath OutputPathName="<#=DataFlowSourceName#>.Output" />  
     </RowCount>  
         <OleDbSource Name="<#=DataFlowSourceName#>" ConnectionName="<#=SrcConnection#>"> 
                <DirectInput><#=SourceQuery#></DirectInput>  
     </OleDbSource>  
         <RowCount Name="CNT_Unchanged_Rows" VariableName="User.RowCountUnchanged">  
           <InputPath OutputPathName="CSPL Check For Changes.Default" />  
     </RowCount>  
         <OleDbDestination Name="OLE_DST Update Table" ConnectionName="<#=DstConnection#>">   
           <InputPath OutputPathName="Identify Change Types.Output" />  
            <ExternalTableOutput Table="[<#=UpdateSchemaName#>].[<#=UpdateTableName#>]" />  
     </OleDbDestination>  
   </Transformations>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Truncate <#=UpdateTableName#>.Output" />   
           </Inputs>  
        </PrecedenceConstraints>  
     </Dataflow>  
     <ExecuteSQL Name="SQL Begin Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditStart" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageName" DataType="String" Length="-1" />  
         <Parameter Name="1" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="2" VariableName="ParentPackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="3" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="4" VariableName="System.StartTime" DataType="Date" Length="-1" />  
   </Parameters>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Close Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditUpdate" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="1" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="2" VariableName="User.RowCountNew" DataType="Int32" Length="-1" />  
         <Parameter Name="3" VariableName="User.RowCountChanged" DataType="Int32" Length="-1" />  
         <Parameter Name="4" VariableName="User.RowCountUnchanged" DataType="Int32" Length="-1" />  
   </Parameters>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Update <#=DstTableName#>.Output" />  
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>    
     <ExecuteSQL Name="SQL Truncate <#=UpdateTableName#>" ConnectionName="<#=UpdateConnection#>">  
       <DirectInput>Truncate Table [<#=UpdateSchemaName#>].[<#=UpdateTableName#>]</DirectInput>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Begin Audit.Output" />  
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Update <#=DstTableName#>" ConnectionName="<#=DstConnection#>">  
          <DirectInput><#=UpdateSQLStatement#></DirectInput>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="DFT_Insert<#=DstTableName#>.Output" /> 
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>  
   </Tasks>  
 </Package>  

Create Dim2Hybrid.Biml

 

 <#@ template language="C#" hostspecific="true" #>  
 <Biml xmlns="http://schemas.varigence.com/biml.xsd">  
 <#@ include file="ProjectConnection.biml" #>  
 <!--  
    <#  
    string PackageName = "LoadDimRoute";  
    string DstSchemaName = "dbo";  
   string DstTableName = "DimRoute";  
    string DstConnection = "AWBIML";     
    string DataFlowSourceName = "OLE_SRC_StgDimRoute";  
    string SrcConnection = "AWBIML";  
    string SourceQuery = @"Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse,   
   RouteArea, RowIsCurrent   
  , HistoricalHashKey, ChangeHashKey, UpdateHashKey   
  , HistoricalHashKeyASCII   
  , ChangeHashKeyASCII   
  , UpdateHashKeyASCII   
  , RowEndDate  
  , RowStartDate    
  FROM [dbo].[StgDimRoute];" ;  
    string UpdateSchemaName = "Updt";     
    string UpdateTableName = "UpdtRoute";  
    string UpdateConnection = "AWBIML";     
 string UpdateSQLStatement = @" UPDATE A    
  SET A.RouteArea = UA.RouteArea,    
    A.UpdateHashKey = UA.UpdateHashKey,    
    A.UpdtDtTm = CURRENT_TIMESTAMP    
  FROM Updt.UpdtRoute AS UA    
    JOIN Dbo.DimRoute AS A    
    ON UA.HistoricalHashKey = A.HistoricalHashKey    
  WHERE UpdateInPlace = 1;   
  UPDATE A    
  SET RowIsCurrent = 0,    
    A.RowEndDate = UA.RowEndDate,    
    A.UpdtDtTm = CURRENT_TIMESTAMP    
  FROM Updt.UpdtRoute UA    
    JOIN Dbo.DimRoute AS A    
    ON UA.HistoricalHashKey = A.HistoricalHashKey    
  WHERE A.RowIsCurrent = 1    
    AND UA.UpdateNewRow = 1;    
  INSERT INTO Dbo.DimRoute    
     (RouteID,    
     RouteDescription,    
     Supervisor,    
     RouteType,    
     Warehouse,    
     RouteArea,    
     RowIsCurrent,    
     RowStartDate,    
     RowEndDate,    
     HistoricalHashKey,    
     ChangeHashKey,    
     UpdateHashKey)    
  SELECT RouteID,    
     RouteDescription,    
     Supervisor,    
     RouteType,    
     Warehouse,    
     RouteArea,    
     1,    
     RowStartDate,    
     null,    
     HistoricalHashKey,    
     ChangeHashKey,    
     UpdateHashKey  
  FROM Updt.UpdtRoute WHERE UpdateNewRow = 1;" ;     
 #>  
 -->  
   <Packages>  
     <#=CallBimlScript("Dim2Hybrid.biml", PackageName, DstSchemaName, DstTableName, DstConnection, DataFlowSourceName, SrcConnection, SourceQuery, UpdateSchemaName, UpdateTableName, UpdateConnection, UpdateSQLStatement)#>  
   </Packages>  
 </Biml>  

Once I have my source view, dimension table, and update table in the database, the 3 Biml files added to my project, and BIDSHelper installed, all I have to do is right click on the CreateDim2Hybrid.Biml file and choose Generate SSIS packages to create my package.

Datazen, Microsoft Technologies, SQL Server, SSRS

Datazen Lives On in SQL Server 2016

Microsoft acquired Datazen back in April 2015, and I explored it and wrote about it a couple months later. To date, Microsoft has mostly left the product as is, although a new version containing bug fixes and a few enhancements was released in September.

While I was at PASS Summit I learned that there is a bright future for Datazen as a significant part of the MSBI reporting offerings. Microsoft has provided a reporting roadmap that looks very promising. They are working to align cloud and on-premises solutions and to harmonize report types. In the MSBI reporting world, there will be 4 report types:

  • Paginated reports (SSRS)
  • Interactive reports (Power BI)
  • Mobile reports (Datazen)
  • Analytical reports (Excel)

For on-premises solutions, you will have one unified SSRS Report manager that supports mobile reports and interactive reports as well as paginated reports.  And it’s much prettier than the old SSRS report manager!

ezgif.com-gif-maker
New vs old SSRS Report Manager, courtesy of @idigdata

There will also be a unified mobile app so there is no need to switch between apps to get Datazen reports and Power BI reports.

You can get the ebook here for free! We plan on updating it as more information comes out on Datazen and SQL Server 2016.

Microsoft Technologies, SQL Server, T-SQL

Notes and Tips on SQL Server Spatial Data Types

I’ve been working on a project that includes geographical data representing stops on a delivery route. I’ve just completed loading this data into a data mart. The source data contains longitude and latitude in millionths of a degree with 9 digits of data. We haven’t decided what tool we will use to visualize this data yet, but we know Power View and Power Map both accept latitude and longitude values. I decided to store my longitude and latitude data in decimal (9,6) fields. There is a good possibility that we may be computing distances between points in the future, so I thought it would be good to store the data as a spatial data type as well. I thought I would share a few things that I learned along the way.

There are two spatial data types in SQL Server: geometry and geography. Geometry represents the flat-earth system where units are all equally spaced apart. Geography represents the round-earth system measured in latitude and longitude. Since I had longitude and latitude in my data, I used the geography data type. The geography spatial data type is implemented as a .NET common language runtime (CLR) data type in SQL Server.

I populated my table using a query of which I’ve included a snippet below. You can see the use of the Point function to create my geography values.

SELECT 
     ...
     [Latitude] = [Latitude] / 1000000.0
    ,[Longitude] = [Longitude] / 1000000.0 
    ,[GeographyPoint] = geography::Point([Latitude]/1000000.0, [Longitude]/1000000.0, 4326)
FROM [MySourceTable]

The Point function accepts a a latitude, longitude, and SRID, and returns a geography value. An SRID is a unique identifier associated with a coordinate system, tolerance, and resolution. SRIDs are not specific to SQL Server. They are maintained by the International Association of Oil & Gas Producers (OGP) Surveying & Positioning Committee. Here’s a blog post that I think does a good job explaining many of the terms associated with spatial data in SQL Server.

Tip #1: You can see a list of SRIDs available in SQL Server by running the following query. SQL Server uses the default SRID of 4326, which is the WGS 84 spatial reference system.

SELECT * FROM sys.spatial_reference_systems

My source database has planned delivery stops and times and actual delivery stops and times stored in separate columns in a very wide table. I decided to pivot that data and create a table with a scenario key that refers to either plan or actual data. To do this, I wrote 2 queries and attempted to union them together to produce my final data set. That’s when I learned:

geography union error

Tip #2: When SQL Server performs a UNION it must compare values to remove duplicate rows. CLR user-defined type columns like geography are not comparable. As long as there is no risk of duplicate data between the two sets, you can use UNION ALL.

The query below works just fine as long as you use UNION ALL.

SELECT 
     [Latitude]
    ,[Longitude] 
    ,[GeographyPoint]
FROM [MySourceTable]
WHERE [RouteID] = 1

UNION ALL

SELECT 
     [Latitude]
    ,[Longitude] 
    ,[GeographyPoint]
FROM [MySourceTable]
WHERE [RouteID] = 5

As I finalized my table design I considered using a computed column to store my geography data. But I encountered an issue when I went to add a spatial index.  Spatial indexes are built on top of B+ trees. They decompose space into 4 levels of grids. I think spatial indexes are interesting, but they have some restrictions of which you should be aware. They require the table to have a clustered primary key. They cannot be specified on indexed views. And…

Tip #3: You can create a computed column to store the geography point based upon the latitude and longitude. But you cannot create a spatial index on a computed column.

If you try to create a spatial index on a computed column you will get SQL Server error message 6342.

You don’t have to use spatial data types just because you have spatial data. Many data viz tools have built-in geocoding that will accept longitude and latitude or an address. But spatial data types can be useful when calculating distances between two points and planning and measuring routes.