Data Warehousing, Microsoft Technologies, SQL Server, SSIS, T-SQL

Why We Don’t Truncate Dimensions and Facts During a Data Load

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Surrogate Keys

Dimensional models use surrogate keys rather than natural keys. Surrogate keys are system-generated, meaningless values that are usually integers used to uniquely identify a record. They provide good performance for joins in queries, allow us to switch or use multiple source systems to feed the same tables, and facilitate the use of slowly changing dimensions. If you truncate a dimension table and then repopulate it, you will end up with different surrogate keys assigned to your dimension values. Let’s say we have a Geography dimension that looks like the table below.

1DenverColoradoUnited States80205
2DallasTexasUnited States75201

If I truncate and reload this table, there is no guarantee that Denver will have a GeographyKey value of 1. I might reload the data and then have Paris as 1, Denver as 2, and Dallas as 3. And since we use surrogate keys as dimension lookups in fact tables and bridge tables (and snowflaked dimensions, if you use those), I would now need to update every table that references this dimension. That’s a lot of table updates without a good reason.

Error Handling

Another reason to avoid truncating the tables in your dimensional model is error handling. When you design your data load, you need to think about what should happen when it fails and where it is most likely to fail. Failures will often occur either in data transformation steps or upon inserting/updating values in the destination table due to data type/size conflicts.

Let’s say you truncate your fact table and then you encounter an ETL failure while performing the transformations to reload the table. Now you have an empty fact table. If you have error handling logic in your ETL, you may be able to redirect the error rows to another location to be handled later. But that still leaves you with an incomplete table at best. If instead of truncating and reloading, you were doing inserts and updates when you encountered an error, your table might reflect slightly stale data as of the previous load. You still might have a situation where your table is partially updated with the rows that were inserted before the error was encountered depending on how your ETL design. But having a fact table with old or partially updated data is usually (but not always) more preferable than having no data when a data load process fails.

Performance and Data Availability

When you truncate and reload a table, you are assuming you have access to all of the source data needed to fully repopulate the table. This may be true today, but what about next year when you switch CRM systems? Or what if your organization makes the decision to archive data in the source system that is older than 5 years? If any data needed for your fact or dimension becomes unavailable, your truncate and reload pattern will fail to serve your needs.

Even if you are sure your data will be available, you may want to consider your data load times. If you only have a handful of tables with a few hundred thousand rows max, you may load your data warehouse in a few minutes. But what happens when you have a few hundred million rows with some complicated transformation logic in your pipelines? You might be adding minutes to hours to your load times.

Why Do ETL Developers End Up Truncating Fact Tables?

Sometimes developers just don’t know better. But often there were understandable reasons for using the truncate and reload pattern. While I have never built a system where I truncated a dimension, I have had a couple where we at least started out with truncating the fact table. We usually built it to make data initially available with correct values, and then worked with stakeholders to find a different way to access the data or to have them generate the data differently to alleviate the problems that caused us to want to truncate and reload.

Often the cause is data quality issues. On one project I worked, we had dimension values that defined the granularity of the fact table that would change in the source systems, and the process to try to update them in the fact table was too complicated and took too long. So we made the decision to truncate and reload, understanding the risks of doing so. Having the data available and usable for analysis gave us more information on how useful the data was to users and helped us work to understand why and how our keys were changing. Over time, we were able to influence the way data was entered into source systems so that we didn’t have to go through the truncate and reload process and could perform incremental loads on our fact table.

Sometimes source systems allow hard deletion of data and don’t maintain a list of what was deleted. And depending on access methods and the size of the data, it can be difficult to compare the rows in a fact table with the source data to identify the deleted rows. I would argue this is a poor design of the source system, but we often can’t change that, and we still need to include the data in our data warehouse. So it’s understandable why someone in this situation would want to truncate and reload a fact table.

If your stakeholders are ok with the potential of long load times, empty tables when errors occur, and the assumption that source data will always be available, then there isn’t much problem with truncating and reloading a fact table. But that is often not the case. And that decision should be made explicitly rather than implicitly.

What To Do Instead of Truncating and Reloading a Fact Table

If you’ve been truncating and reloading a fact table because that just seemed like the simplest thing to do, you can change your load pattern.

My general load pattern is:

  1. Truncate the update table
  2. Insert new rows into the fact table and changed rows into the update table
  3. Perform set-based updates on the fact table based upon the data in the update table

I’m able to determine what rows are new, changed, and unchanged by using hash values. I concatenate and hash the values that define the level of uniqueness of the row into one value I call HistoricalHash and the values from the remaining columns into a value I call ChangeHash. In other words, I load a transactional or periodic snapshot fact table in a manner similar to a Type 1 slowly changing dimension.

If you have data quality, data deletion, or other issues that prevent you from using a change detection pattern like the above, consider using a staging table and swapping it out with the fact table. Create a staging table that looks exactly like your fact table.

  1. Truncate the staging table
  2. Populate the staging table with all rows that should be in the fact table
  3. Swap the staging table and the fact table (this usually involves renaming the tables)

Understand the Consequences of Your Design

If you are choosing to truncate a dimension or fact table, be sure that you understand the trade-offs. You may be unwittingly opting for simplicity of ETL over data availability and efficient data loads.

Azure, Azure SQL DW, Data Warehousing, Microsoft Technologies, T-SQL

What You Need to Know About Data Classifications in Azure SQL Data Warehouse

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.

    WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')

To view existing data classifications, you can query the sys.sensitivity_classifications view or look in the Azure Portal.

SELECT as [TableName], as [ColumnName],
FROM sys.sensitivity_classifications
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
    and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id

Be Careful When Loading With CTAS and Rename

One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:

  1. Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
  2. Rename the dimension table to something like Dimension_OLD
  3. Rename the upsert table to Dimension
  4. Drop the Dimension_OLD table
Animation of a table load process in Azure SQL DW

In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.

Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.

There are a few complicating factors:

  • There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
  • ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
  • The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
  • SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.

The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.

Update: Someone asked about the name SwapWithMetadata and why it doesn’t specifically mention sensitivity classifications. I didn’t mention classifications because there are other things that need this same treatment. Dynamic data masking will also need to be reapplied. With dynamic data masking, it will be even more important to add it back immediately after swapping the tables rather than waiting for a full data load of all selected tables to finish and adding all classifications back. If your load takes a long time or the process fails on another table, you don’t want your data exposed without a mask to users who shouldn’t see the full information.

CREATE PROC SwapWithMetadata
@SrcSchema NVARCHAR(128),
@SrcTable NVARCHAR(128),
@DestSchema NVARCHAR(128),
@DestTable NVARCHAR(128),
@TransferMetadata BIT,
@DropOldTable BIT
–Check if destination table exists
DECLARE @DestSchemaQualifiedTableName NVARCHAR(257)
SET @DestSchemaQualifiedTableName = @DestSchema + '.' + @DestTable
IF OBJECT_ID(@DestSchemaQualifiedTableName) IS NULL
SET @DestErr = 'Table ' + @DestSchemaQualifiedTableName + ' not found'
RAISERROR(@DestErr, 15, 1)
–Check if source table exists
DECLARE @SrcSchemaQualifiedTableName NVARCHAR(257)
SET @SrcSchemaQualifiedTableName = @SrcSchema + '.' + @SrcTable
IF OBJECT_ID(@SrcSchemaQualifiedTableName) IS NULL
SET @SrcErr = 'Table ' + @SrcSchemaQualifiedTableName + ' not found'
RAISERROR(@SrcErr, 15, 1)
–Move destination table to destination_old. Move source table to destination
SET @RenameSql = 'RENAME OBJECT ' + @DestSchemaQualifiedTableName + ' TO ' + @DestTable + '_old; '
Set @RenameSql = @RenameSql + ' RENAME OBJECT ' + @SrcSchemaQualifiedTableName + ' TO ' + @DestTable
PRINT 'Executing ' + @RenameSql + ' …'
EXEC sp_executesql @RenameSql;
–drop temp table if it exists
IF OBJECT_ID('tempDB..#tempApplySensitivityClassificationsToTable') IS NOT NULL
DROP TABLE #tempApplySensitivityClassificationsToTable;
–check if we should transfer data classifications from old to new table
IF ISNULL(@TransferMetadata,0) = 1
–put current classifications in a temp table
DECLARE @OldTable NVARCHAR(128) = @DestTable + '_old';
WITH CurrentClassifications as (
CAST('dbo' as NVARCHAR(128)) [Schema],
CAST( as NVARCHAR(128)) [Table],
CAST( as NVARCHAR(128)) [Column],
CAST([Information_Type] as NVARCHAR(128)) [Informationtype],
CAST([Label] as NVARCHAR(128)) [Label]
LEFT OUTER JOIN sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
LEFT OUTER JOIN sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id
SELECT ROW_NUMBER() OVER (ORDER BY [Schema], [Table], [Column]) [ID],
[Schema], [Table], [Column], [Informationtype], [Label]
INTO #tempApplySensitivityClassificationsToTable
FROM CurrentClassifications
WHERE [Schema] = @DestSchema AND [Table] = @OldTable;
SET @i = 1
FROM #tempApplySensitivityClassificationsToTable;
PRINT 'Transferring ' + CAST(@Max as VARCHAR(4)) + ' classifications'
–drop and recreate sensitivity classifications
WHILE @i <= @Max
SELECT @Col = [Column], @InfoType = [InformationType], @Label = [Label]
FROM #tempApplySensitivityClassificationsToTable
WHERE Id = @i
SET @Sql = 'DROP SENSITIVITY CLASSIFICATION FROM ' + @DestSchemaQualifiedTableName + '.' + @Col
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
SET @Sql = 'ADD SENSITIVITY CLASSIFICATION TO ' + @DestSchemaQualifiedTableName + '.' + @Col
SET @Sql = @Sql + ' WITH (LABEL=''' + @Label + ''', INFORMATION_TYPE=''' + @InfoType + ''')'
SET @Sql = @Sql + ' WITH (INFORMATION_TYPE=''' + @InfoType + ''')'
SET @Sql = @Sql + ' WITH (LABEL=''' + @InfoType + ''')'
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
SET @i = @i + 1
IF ISNULL(@DropOldTable,0) = 1
SET @DropSql = 'DROP TABLE ' + @DestSchemaQualifiedTableName + '_old;'
PRINT 'Executing ' + @DropSql + '…'
EXEC sp_executesql @DropSql;
Print 'ERROR… Procedure: ' + ERROR_PROCEDURE() + ' Message: ' + ERROR_MESSAGE()

Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.

For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.

Vote for My Enhancement Idea

If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.

Not an Issue with Other Data Load Methods

Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.

Give Data Classifications a Try

I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.

Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?

Azure, Data Warehousing, Microsoft Technologies, T-SQL

Create a Date Dimension in Azure SQL Data Warehouse

Most data warehouses and data marts require a date dimension or calendar table. Those of us that have been building data warehouses in SQL Server for a while have collected our favorite scripts to build out a date dimension. For a standard date dimension, I am a fan of Aaron  Bertrand’s script posted on But the current version (as of Aug 8, 2016) of Azure SQL Data Warehouse doesn’t support computed columns, which are used in Aaron’s script.

I adapted Aaron’s script to work in Azure SQL Data Warehouse and am sharing it with you below, so you don’t have to do the same. I did leave out holidays because I didn’t need them in my calendar table.  To add them back in, just add the column back to the table and use Aaron’s update statements. I also changed the MonthYear field to include a space between the month and year. Otherwise, my script should produce the same results as Aaron’s in a Azure SQL DW friendly way. Notice that I chose to use Round Robin distribution for my date table. Also, tables are now created with clustered columnstore indexes by default in Azure SQL DW.

Creates a table called RPT.Calendar. Change the table name on line 69. Change date range on line 2.
DECLARE @StartDate DATE = '20100101', @NumberOfYears INT = 30;
prevent set or regional settings from interfering with
interpretation of dates / literals
[date] DATE,
[day] tinyint,
[month] tinyint,
FirstOfMonth date,
[MonthName] varchar(12),
[week] tinyint,
[ISOweek] tinyint,
[DayOfWeek] tinyint,
[quarter] tinyint,
[year] smallint,
FirstOfYear date,
Style112 char(8),
Style101 char(10)
DECLARE @CutoffDate DATE = DATEADD(YEAR, @NumberOfYears, @StartDate);
this is just a holding table for intermediate calculations:
use the catalog views to generate as many rows as we need
INSERT #dimdate([date])
SELECT d = DATEADD(DAY, rn 1, @StartDate)
SELECT TOP (DATEDIFF(DAY, @StartDate, @CutoffDate))
rn = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
on my system this would support > 5 million days
ORDER BY s1.[object_id]
) AS x
) AS y;
[day] = DATEPART(DAY, [date]),
[month] = DATEPART(MONTH, [date]),
[MonthName] = DATENAME(MONTH, [date]),
[week] = DATEPART(WEEK, [date]),
[ISOweek] = DATEPART(ISO_WEEK, [date]),
[DayOfWeek] = DATEPART(WEEKDAY, [date]),
[quarter] = DATEPART(QUARTER, [date]),
[year] = DATEPART(YEAR, [date]),
Style112 = CONVERT(CHAR(8), [date], 112),
Style101 = CONVERT(CHAR(10), [date], 101)
DateKey = CONVERT(INT, Style112),
[Date] = [date],
[Day] = CONVERT(TINYINT, [day]),
DaySuffix = CONVERT(CHAR(2), CASE WHEN [day] / 10 = 1 THEN 'th' ELSE
CASE RIGHT([day], 1) WHEN '1' THEN 'st' WHEN '2' THEN 'nd'
WHEN '3' THEN 'rd' ELSE 'th' END END),
[Weekday] = CONVERT(TINYINT, [DayOfWeek]),
[WeekDayName] = CONVERT(VARCHAR(10), DATENAME(WEEKDAY, [date])),
(PARTITION BY FirstOfMonth, [DayOfWeek] ORDER BY [date])),
(PARTITION BY [year], [month] ORDER BY [week])),
WeekOfYear = CONVERT(TINYINT, [week]),
[Month] = CONVERT(TINYINT, [month]),
[MonthName] = CONVERT(VARCHAR(10), [MonthName]),
[Quarter] = CONVERT(TINYINT, [quarter]),
QuarterName = CONVERT(VARCHAR(6), CASE [quarter] WHEN 1 THEN 'First'
WHEN 2 THEN 'Second' WHEN 3 THEN 'Third' WHEN 4 THEN 'Fourth' END),
[Year] = [year],
MMYYYY = CONVERT(CHAR(6), LEFT(Style101, 2) + LEFT(Style112, 4)),
MonthYear = CONVERT(CHAR(8), LEFT([MonthName], 3) + ' ' + LEFT(Style112, 4)),
FirstDayOfMonth = FirstOfMonth,
LastDayOfMonth = MAX([date]) OVER (PARTITION BY [year], [month]),
FirstDayOfQuarter = MIN([date]) OVER (PARTITION BY [year], [quarter]),
LastDayOfQuarter = MAX([date]) OVER (PARTITION BY [year], [quarter]),
FirstDayOfYear = FirstOfYear,
LastDayOfYear = MAX([date]) OVER (PARTITION BY [year]),
FirstDayOfNextMonth = DATEADD(MONTH, 1, FirstOfMonth),
FirstDayOfNextYear = DATEADD(YEAR, 1, FirstOfYear)
FROM #dimdate
DROP Table #dimdate;

Instead of the computed columns, I inserted the dates, then updated the other columns in the temporary table. Then I used the Create Table As Select syntax to create my final RPT.Calendar table.

Happy time trending!

Microsoft Technologies, SQL Server, T-SQL

Notes and Tips on SQL Server Spatial Data Types

I’ve been working on a project that includes geographical data representing stops on a delivery route. I’ve just completed loading this data into a data mart. The source data contains longitude and latitude in millionths of a degree with 9 digits of data. We haven’t decided what tool we will use to visualize this data yet, but we know Power View and Power Map both accept latitude and longitude values. I decided to store my longitude and latitude data in decimal (9,6) fields. There is a good possibility that we may be computing distances between points in the future, so I thought it would be good to store the data as a spatial data type as well. I thought I would share a few things that I learned along the way.

There are two spatial data types in SQL Server: geometry and geography. Geometry represents the flat-earth system where units are all equally spaced apart. Geography represents the round-earth system measured in latitude and longitude. Since I had longitude and latitude in my data, I used the geography data type. The geography spatial data type is implemented as a .NET common language runtime (CLR) data type in SQL Server.

I populated my table using a query of which I’ve included a snippet below. You can see the use of the Point function to create my geography values.

     [Latitude] = [Latitude] / 1000000.0
    ,[Longitude] = [Longitude] / 1000000.0 
    ,[GeographyPoint] = geography::Point([Latitude]/1000000.0, [Longitude]/1000000.0, 4326)
FROM [MySourceTable]

The Point function accepts a a latitude, longitude, and SRID, and returns a geography value. An SRID is a unique identifier associated with a coordinate system, tolerance, and resolution. SRIDs are not specific to SQL Server. They are maintained by the International Association of Oil & Gas Producers (OGP) Surveying & Positioning Committee. Here’s a blog post that I think does a good job explaining many of the terms associated with spatial data in SQL Server.

Tip #1: You can see a list of SRIDs available in SQL Server by running the following query. SQL Server uses the default SRID of 4326, which is the WGS 84 spatial reference system.

SELECT * FROM sys.spatial_reference_systems

My source database has planned delivery stops and times and actual delivery stops and times stored in separate columns in a very wide table. I decided to pivot that data and create a table with a scenario key that refers to either plan or actual data. To do this, I wrote 2 queries and attempted to union them together to produce my final data set. That’s when I learned:

geography union error

Tip #2: When SQL Server performs a UNION it must compare values to remove duplicate rows. CLR user-defined type columns like geography are not comparable. As long as there is no risk of duplicate data between the two sets, you can use UNION ALL.

The query below works just fine as long as you use UNION ALL.

FROM [MySourceTable]
WHERE [RouteID] = 1


FROM [MySourceTable]
WHERE [RouteID] = 5

As I finalized my table design I considered using a computed column to store my geography data. But I encountered an issue when I went to add a spatial index.  Spatial indexes are built on top of B+ trees. They decompose space into 4 levels of grids. I think spatial indexes are interesting, but they have some restrictions of which you should be aware. They require the table to have a clustered primary key. They cannot be specified on indexed views. And…

Tip #3: You can create a computed column to store the geography point based upon the latitude and longitude. But you cannot create a spatial index on a computed column.

If you try to create a spatial index on a computed column you will get SQL Server error message 6342.

You don’t have to use spatial data types just because you have spatial data. Many data viz tools have built-in geocoding that will accept longitude and latitude or an address. But spatial data types can be useful when calculating distances between two points and planning and measuring routes.

BIDS Helper, Biml, Data Warehousing, Microsoft Technologies, SSIS, T-SQL

Biml for a Type 1 Slowly Changing Dimension

I’ve been working on building my Biml library over the last few months. One of the first design patterns I created was a Type 1 Slowly Changing Dimension where all fields except the key fields that define the level of granularity are overwritten with updated values.  It assumes I have a staging table, but it could probably be easily modified to pull directly from the source table if needed.

For simplicity, I’m going to create DimSalesReason from the AdventureWorks database as my example. Please pardon the large amount of code in this post. I wanted to provide a fairly complete picture so you can see how the pieces fit together so I’m providing all of the SQL and Biml needed to generate the package.

Required Database Objects

My staging table is a copy of source table and looks like this:

CREATE TABLE [Staging].[SalesReason](
    [SalesReasonID] [int] NOT NULL,
    [Name] [nvarchar](50) NOT NULL,
    [ReasonType] [nvarchar](50) NOT NULL,
    [ModifiedDate] [datetime] NOT NULL

Here is the DDL for my dimension table:

CREATE TABLE [dbo].[DimSalesReason](
    [SalesReasonKey] int IDENTITY(1,1) NOT NULL Primary Key
    ,[SalesReasonID] int not null
    ,[SalesReasonName] nvarchar(50) not null
    ,[SalesReasonType] nvarchar(50) not null
    ,[HistoricalHashKey] varbinary(20) NOT NULL
    ,[ChangeHashKey] varbinary(20) NOT NULL
    ,[InsertDttm] datetime not null
    ,[UpdateDttm] datetime NULL

You’ll notice I have some audit fields in my table. UpdateDttm is the datetime when the row was last updated. InsertDttm is the datetime when the row was initially inserted. The HistoricalHashKey and ChangeHashKey are used for change detection. I may not need them in a dimension this simple, but I use them in larger/wider dimensions and I like my dimensions to be built consistently (unless they need to be tweaked for performance). The HistoricalHashKey represents the business key of the table that defines the level of uniqueness for the row in the dimension table. The ChangeHashKey represents all the other fields that are used in the dimension row. This makes it easier to determine if values have changed since I can compare just the ChangeHashKey instead of each individual field.

I also have an update table. This allows me to update in bulk rather than updating in place row by row. With a very small table, you might not notice a performance difference, but this pattern becomes much more efficient as the table grows.

CREATE TABLE [Updt].[UpdtSalesReason](
	 [SalesReasonID] int not null
	,[SalesReasonName] nvarchar(50) not null
	,[SalesReasonType] nvarchar(50) not null
	,[HistoricalHashKey] varbinary(20) NOT NULL
	,[ChangeHashKey] varbinary(20) NOT NULL
	,[InsertDttm] datetime not null

I use a view to do most of the transformation work (joins, business logic, etc.) for my Type 1 SCD package. Here’s my view for the SalesReasonDimension:

Create View [Staging].[DimSalesReason] as
With SalesReasonData as (
	Select	 [SalesReasonID] 
		,[Name] as [SalesReasonName] 
		,[ReasonType] as [SalesReasonType] 
                                                               AS [HistoricalHashKey]
		,CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(Name, ' ', ReasonType))) 
                                                                   AS [ChangeHashKey]
	Select	-1 as [SalesReasonID]
	   , 'Unknown' as [SalesReasonName]
	   , 'Unknown' as [SalesReasonType]
                                                               AS [HistoricalHashKey]
	   ,CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT('Unknown', ' ', 'Unknown')))
                                                               AS [ChangeHashKey]
	   ,CURRENT_TIMESTAMP as InsertDttm
           ,CURRENT_TIMESTAMP as UpdtDttm
Select SalesReasonID
	   , SalesReasonName
	   , SalesReasonType
	   , HistoricalHashKey
	   , ChangeHashKey
	   , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII
	   , CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII 
	   , InsertDttm
	   , UpdtDttm
from SalesReasonData

Using a view to do the transformation work allows me to use the SQL Server engine to do the things it is good at (joins, case statements, conversions) and SSIS to do the things it is good at (controlling the flow of data). It also makes it easy to create a design pattern in Biml that is abstract enough that it easily fits most SCD 1 scenarios. The hashkey fields allow me to do a lookup on one field to determine if the row already exists in the dimension table and a quick comparison to determine if any of the values for that row have changed. I chose to use an MD5 hash because I think it is a good balance of speed/size and collision risk. You may feel differently (others prefer SHA1 due to lower collision risk). On a dimension this size, the difference probably isn’t noticeable.

You can also see that I add my unknown member row in my view. Some people prefer to set the surrogate key of the unknown value to -1. I prefer to set the business key to -1 and let the surrogate key be set to any value in the load process. I’m not a fan of having to turn off the identity insert to add the unknown row. When I do the dimension key lookup for fact tables, I just look for the row where the business key = -1. This also means I don’t have to check my dimension tables to see if someone remembered to insert the unknown rows after deployment to a new environment because I know the unknown rows will be inserted when the package is run.

I have audit steps in my package that write to the Audit.PackageControl table using stored procedures.

CREATE TABLE [Audit].[Package_Control](
	[Package_NM] [varchar](100) NOT NULL,
	[Package_ID] [uniqueidentifier] NOT NULL,
	[Parent_Package_ID] [uniqueidentifier] NULL,
	[Execution_ID] [bigint] NULL,
	[Start_TS] [datetime] NOT NULL,
	[Stop_TS] [datetime] NULL,
	[Insert_Row_QT] [int] NULL,
	[Update_Row_QT] [int] NULL,
	[Unchanged_Row_QT] [int] NULL,
	[Deleted_Row_QT] [int] NULL,
	[Duration_s]  AS (datediff(second,[Start_TS],[Stop_TS])),
	[PackageLogID] [int] IDENTITY(1,1) NOT NULL

CREATE PROCEDURE [Audit].[PackageControlStart]
    @PackageName varchar(100)
,   @PackageId uniqueidentifier
,   @ParentPackageId uniqueidentifier = NULL
,   @ExecutionId bigint
,   @StartTime DATETIME
,   @StopTime datetime = NULL
,   @InsertRowQuantity int = NULL
,   @UpdateRowQuantity int = NULL
,   @UnchangedRowQuantity int = NULL
	DECLARE @PackageLogId int 
    INSERT INTO [Audit].[Package_Control]
    ,   [Package_ID]
    ,   [Parent_Package_ID]
    ,   [Execution_ID]
    ,   [Start_TS]
    ,   [Stop_TS]
    ,   [Insert_Row_QT]
    ,   [Update_Row_QT]
    ,	[Unchanged_Row_QT]
    ,   @PackageId 
    ,   @ParentPackageId 
    ,   @ExecutionId 
    ,   @StopTime 
    ,   @InsertRowQuantity 
    ,   @UpdateRowQuantity 
    ,	@UnchangedRowQuantity
  SELECT  @PackageLogID as PackageLogID

CREATE PROCEDURE [Audit].[PackageControlStop]
    @PackageId uniqueidentifier
,   @ExecutionId bigint
,   @InsertRowQuantity int = NULL
,   @UpdateRowQuantity int = NULL
,	@UnchangedRowQuantity int = NULL
    -- Close out the execution.
    ,   [Insert_Row_QT] = @InsertRowQuantity
    ,   [Update_Row_QT] = @UpdateRowQuantity
	,	[Unchanged_Row_QT] = @UnchangedRowQuantity
    FROM  [Audit].[Package_Control] AS PC
    WHERE PC.Package_ID = @PackageId
        AND PC.Execution_ID = @ExecutionId
        AND PC.[Stop_TS] IS NULL;  

And Now for the Biml

My Biml library usually contains 3 files for each package type:

  • ProjectConnections – I like to keep my connection managers in a separate file so I only have to update one place if I need to add or change a connection.
  • Dim1 – This contains my actual design pattern.
  • CreateDim1 – This is the Biml file I run to generate the package. It gets separated so I can pull values from databases and pass in variables to my design pattern using BimlScript. For this example I have hardcoded my variables into this file rather than pulling from a database.


<#@ template language="C#" tier="1" #>	
    <OleDbConnection Name="AWBIML" ConnectionString ="Data Source=localhost\SQL2014;
        Initial Catalog=AWBIML;Integrated Security=SSPI;Provider=SQLNCLI11.1;" CreateInProject="true"/>
    <OleDbConnection Name="Audit" ConnectionString ="Data Source=localhost\SQL2014;
        Initial Catalog=AWBIML;Integrated Security=SSPI;Provider=SQLNCLI11.1;" CreateInProject="true"/>	

The ProjectConnections file doesn’t begin with Biml tags because it gets included in the middle of the CreateDim1.biml file. AWBIML is a local database I created to contain my data mart.


<#@ template language="C#" tier="2" #>

<#@ property name="PackageName" type="String" #>
<#@ property name="DstSchemaName" type="String" #>
<#@ property name="DstTableName" type="String" #>
<#@ property name="DstConnection" type="String" #>
<#@ property name="DataFlowSourceName" type="String" #>
<#@ property name="SrcConnection" type="String" #>
<#@ property name="SourceQuery" type="String" #>	
<#@ property name="UpdateSchemaName" type="String" #>	
<#@ property name="UpdateTableName" type="String" #>	
<#@ property name="UpdateConnection" type="String" #>		
<#@ property name="UpdateSQLStatement" type="String" #>	
<Package Name="<#=PackageName#>" Language="None">
        <Parameter DataType="String" Name="ParentPackageID">00000000-0000-0000-0000-000000000000</Parameter>
        <Variable EvaluateAsExpression="true" DataType="String"  
        <Variable DataType="String" Name="QueryAuditStart">
          EXECUTE [Audit].[PackageControlStart] @PackageName=?, @PackageId=?, 
          @ParentPackageId=?,@ExecutionId=?, @StartTime=?;
        <Variable DataType="String" Name="QueryAuditUpdate">
          EXECUTE [Audit].[PackageControlStop] @PackageId=?, @ExecutionId=?, 
          @InsertRowQuantity=?, @UpdateRowQuantity=?, @UnchangedRowQuantity=?;</Variable> 
        <Variable DataType="Int32"  Name="RowCountChanged">0</Variable>
        <Variable DataType="Int32"  Name="RowCountNew">0</Variable>
        <Variable DataType="Int32"  Name="RowCountSource">0</Variable>
        <Variable DataType="Int32"  Name="RowCountUnchanged">0</Variable>
        <Variable DataType="String" Name="SchemaName"><#=DstSchemaName#></Variable>
        <Variable DataType="String" Name="TableName"><#=DstTableName#></Variable>
      <ExecuteSQL Name="SQL Begin Audit" ConnectionName="Audit">
       <VariableInput VariableName="User.QueryAuditStart" />
          <Parameter Name="0" VariableName="System.PackageName" DataType="String" Length="-1"/>
          <Parameter Name="1" VariableName="System.PackageID" DataType="Guid" Length="-1"/>
          <Parameter Name="2" VariableName="ParentPackageID" DataType="Guid" Length="-1"/>
          <Parameter Name="3" VariableName="System.ServerExecutionID" DataType="Int64" 
          <Parameter Name="4" VariableName="System.StartTime" DataType="Date" Length="-1"/>

        <ExecuteSQL Name="SQL Truncate <#=UpdateTableName#>" 
            <DirectInput>Truncate Table [<#=UpdateSchemaName#>].[<#=UpdateTableName#>]
              <Input OutputPathName="SQL Begin Audit.Output" />

        <Dataflow Name="DFT Insert<#=DstTableName#>">
                <RowCount Name="CNT Changed_Rows" VariableName="User.RowCountChanged">
                    <InputPath OutputPathName="CSPL Check For Changes.ChangedRows" />
                <ConditionalSplit Name="CSPL Check For Changes">
                    <InputPath OutputPathName="LKP Historical Key.Match" />
                        <OutputPath Name="ChangedRows">
                          <Expression>ChangeHashKeyASCII != lkp_ChangeHashKeyASCII</Expression>
                <RowCount Name="CNT New_Rows" VariableName="User.RowCountNew">
                    <InputPath OutputPathName="LKP Historical Key.NoMatch" />
                <Lookup Name="LKP Historical Key" NoMatchBehavior="RedirectRowsToNoMatchOutput" 
                      CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII
                      , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII
                     <Parameter SourceColumn="HistoricalHashKeyASCII" />
	           <ParameterizedQuery>select * from (SELECT
                     CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII
                     , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII
                     <#=DstSchemaName#>.<#=DstTableName#>) [refTable]
                     where [refTable].[HistoricalHashKeyASCII] = ?</ParameterizedQuery>
                  <InputPath OutputPathName="CNT Source_Rows.Output" />
                    <Column SourceColumn="HistoricalHashKeyASCII" 
                      TargetColumn="HistoricalHashKeyASCII" />
                    <Column SourceColumn="ChangeHashKeyASCII" 
                    TargetColumn="lkp_ChangeHashKeyASCII" />
            <OleDbDestination Name="OLE_DST New_Rows" ConnectionName="<#=DstConnection#>">
              <InputPath OutputPathName="CNT New_Rows.Output" />
	      <ExternalTableOutput Table="<#=DstSchemaName#>.<#=DstTableName#>" /> 
            <RowCount Name="CNT Source_Rows" VariableName="User.RowCountSource">
              <InputPath OutputPathName="<#=DataFlowSourceName#>.Output" />
            <OleDbSource Name="<#=DataFlowSourceName#>" ConnectionName="<#=SrcConnection#>">
            <RowCount Name="CNT Unchanged_Rows" VariableName="User.RowCountUnchanged">
              <InputPath OutputPathName="CSPL Check For Changes.Default" />
            <OleDbDestination Name="OLE_DST Update Table" ConnectionName="<#=DstConnection#>"> 
              <InputPath OutputPathName="CNT Changed_Rows.Output" />
	      <ExternalTableOutput Table="[<#=UpdateSchemaName#>].[<#=UpdateTableName#>]" />
          <Input OutputPathName="SQL Truncate <#=UpdateTableName#>.Output" /> 
    <ExecuteSQL Name="SQL Update <#=DstTableName#>" 
            <Input OutputPathName="DFT Insert<#=DstTableName#>.Output" /> 
    <ExecuteSQL Name="SQL Close Audit" ConnectionName="Audit">
      <VariableInput VariableName="User.QueryAuditUpdate" />
          <Parameter Name="0" VariableName="System.PackageID" DataType="Guid" Length="-1"/>
          <Parameter Name="1" VariableName="System.ServerExecutionID" DataType="Int64"
          <Parameter Name="2" VariableName="User.RowCountNew" DataType="Int32" Length="-1"/>
          <Parameter Name="3" VariableName="User.RowCountChanged" DataType="Int32" 
          <Parameter Name="4" VariableName="User.RowCountUnchanged" DataType="Int32" 
            <Input OutputPathName="SQL Update <#=DstTableName#>.Output" /> 



<#@ template language="C#" hostspecific="true" #>
<Biml xmlns="">
<#@ include file="ProjectConnection.biml" #>

	string PackageName  = "LoadDimSalesReason";
	string DstSchemaName = "dbo";
	string DstTableName = "DimSalesReason";
	string DstConnection = "AWBIML";
	string DataFlowSourceName = "OLE_SRC StgDimSalesReason";
	string SrcConnection = "AWBIML";
	string SourceQuery = @"SELECT [SalesReasonID]
				  FROM [AWBIML].[Staging].[DimSalesReason]";
	string UpdateSchemaName = "Updt";
	string UpdateTableName = "UpdtSalesReason";
	string UpdateConnection = "AWBIML";
	string UpdateSQLStatement = @"Update d 
					d.SalesReasonName = u.SalesReasonName,
					d.SalesReasonType = u.SalesReasonType,
					d.changehashkey = u.changehashkey 
					FROM Updt.UpdtSalesReason u 
					inner join dbo.DimSalesReason d
                                        on u.HistoricalHashKey = d.HistoricalHashKey;";
        <#=CallBimlScript("Dim1.biml", PackageName, DstSchemaName, DstTableName, DstConnection,
        DataFlowSourceName, SrcConnection, SourceQuery, UpdateSchemaName, UpdateTableName, 
        UpdateConnection, UpdateSQLStatement)#>

Resulting SSIS Package

Executing the CreateDim1.Biml file generates a package called LoadDimSalesReason.dtsx. Here is the control flow:
BIML Dim1 ControlFlow
I start the package by logging the package start in my PackageControl table. Then I truncate my update table to prepare for the new data load. Next I have a data flow task, which inserts data into either the dbo.DimSalesReason table or the Updt.UpdtSalesReason table (see below). The SQL Update task updates the DimSalesReason table with the rows that were inserted into UpdtSalesReason. And finally, I log my package completion.

Here’s my data flow:
BIML Dim1 DataFlow
First, I retrieve my data from my staging view. Then I count the number of rows coming from my source so I can log it in my PackageControl table. The lookup on the HistoricalHashKey field sends rows with no match to be inserted into the dimension table. If the rows have a match, they are checked to see if their ChangeHashKey values match the value of the ChangeHaskey in the existing row in the dimension table. If ChangeHashKey values match, the row is counted and nothing else is done. If the ChangeHashKey values don’t match, the row is counted and written to the update table.

And that is all you need to create a Type 1 dimension load with Biml.

Excel, Microsoft Technologies, Power BI, Power Pivot, Power View, SSAS, T-SQL


I’ve had several occasions to use OPENROWSET recently in T-SQL.  Although you can use it as an ad hoc alternative to a linked server to query a relational database, I’m finding it useful to get data from other sources. I’ll go into details of how to use it, but first I would like to acknowledge: OPENROWSET is not the answer to everything.  Sometimes you should use a linked server or SSIS or some other method.  I had some specific cases where I felt OPENROWSET made more sense for my task.

Importing Data from an Excel Spreadsheet

Technical need: I had some data that needed to be validated before importing into SQL Server from Excel as part of an upload process in an application.  If it didn’t pass validation, I did not want to import the data. Rather than use SSIS to import the data into a table and then check it, I used OPENROWSET to pull it directly from the Excel file into a temp table, ran all of my validation checks, and then imported the data using SSIS all within a stored procedure that was called by an application.  Field and table names have been changed to protect the innocent.

The stored procedure:

@datafile nvarchar(127) 
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
        @sql nvarchar(max)
    ,   @rowcount int
        @rowcount = 0

    DECLARE @FileExists int

        EXECUTE master.dbo.xp_fileExist @datafile, @FileExists out
        -- Value of 1 = file exists
        -- Value of 0 = file doesn't
        IF @FileExists = 0
            CONCAT('The file ', @datafile, ' could not be found.') 
                                                   AS reasontext
        ,   NULL AS field2
        ,   NULL AS field3;            
            RETURN -1;
            ERROR_MESSAGE()  AS reasontext
        ,   NULL AS field2
        ,   NULL AS field3;
        RETURN -1;

 TRUNCATE TABLE dbo.finaldest;
    IF OBJECT_ID('tempdb..##tempf') IS NOT NULL
        DROP TABLE ##tempf;

--Create temp table to hold values from Excel file
Create table ##tempf (
Field1 [nvarchar](255),
Field2 [nvarchar](255),
Field3 [nvarchar](255),
Field4 [nvarchar](255),
Field5 [nvarchar](255),
Field6 [nvarchar](255)
--dynamic sql to populate temp table with values from Excel file
    @sql = N'INSERT INTO ##tempf
                ''Microsoft.ACE.OLEDB.12.0'',''Excel 12.0;Database='
                + @datafile
                + ';HDR=YES;IMEX=1'', 
                ''SELECT s.* FROM [Sheet1$] AS S 
                WHERE [Field2] IS NOT NULL AND [Field3] IS NOT NULL'')';

        EXEC sp_executesql @sql;
            'File not found or invalid file type' AS reasontext
            --ERROR_MESSAGE()  AS reasontext
        ,   NULL AS Field2
        ,   NULL AS Field3;
 IF OBJECT_ID('tempdb..##tempf', 'U') IS NOT NULL 
            DROP TABLE ##tempf
      RETURN -1
/*Do lots of data validation here
--if data validation tests are passed, write to final destination table
Insert INTO dbo.[finaldest]
   Select [Field1],
      from ##tempf;

 Drop table ##tempf; 

To make this work you need to install the Microsoft ACE OLE DB 12.0 provider on your server. You cannot use the Jet OLEDB Provider with 64-bit SQL Server and 64-bit Excel. Once you have this installed you need to change some settings:

EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', 
N'AllowInProcess', 1
EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', 
N'DynamicParameters', 1
sp_configure 'show advanced options', 1;
sp_configure 'Ad Hoc Distributed Queries', 1;

Here is a blog post that explains the errors you might get if you don’t have these settings correct. It also notes that you can go check your settings in the registry keys. For SQL Server 2012 the path is HKEY_LOCAL_MACHINE -> SOFTWARE -> Microsoft -> Microsoft SQL Server -> MSSQL11.MSSQLSERVER -> Providers -> Microsoft.ACE.OLEDB.12.0.  You can then see that AllowInProcess and Dynamic parameters are set to 1.


This provider seems to be a bit fussy and there are some things that took me a while to troubleshoot once I got started.  First, make sure that the location of the Excel file you are reading is accessible to the account that is running the SQL engine and doesn’t require admin access. The Linked Server errors that get returned with this really aren’t that helpful, so you have to use your Google-fu to work your way through it.  I used a global temporary table so it would be accessible across connections if it needed it to be (I was having trouble when I used a local temp table).  The application that calls this stored procedure queues up the calls and executes them serially so one can’t override the other.

Returning results from an Analysis Services cube in SQL

Technical need: I had some data (several facts with common dimensions) with calculations in a multidimensional cube and an application that needed that data but couldn’t/wouldn’t consume the MDX result set.

The query: For demonstration purposes I’m using a small query that returns the values for a drop-down box for selecting a member of the date dimension.

select a."[Measures].[DateKeys]" as DateKey, 
a."[Measures].[DateLevel]" as DateLevel, 
a."[Measures].[DateValues]" as DateValue 
('MSOLAP','Datasource=localhost; Initial Catalog=MyCubeDB;',
'with member [Measures].[DateValues] as
    Iif([Date].[Fiscal Calendar].currentmember.level_number = 2, "    " 
    + [Date].[Fiscal Calendar].currentmember.member_value, 
    [Date].[Fiscal Calendar].currentmember.member_value)
  member [Measures].[DateKeys] as 
    [Date].[Fiscal Calendar].currentmember.uniquename
  member [Measures].[DateLevel] as 
    [Date].[Fiscal Calendar].currentmember.level_number

 {[Measures].[DateValues], [Measures].[DateKeys], 
                           [Measures].[DateLevel]} on 0 ,
 {Exists(Descendants([Date].[Fiscal Calendar].[FY2014], 1, 
                         SELF_AND_BEFORE), , "MyMeasureGroup") } on 1
 from [MyCube]
') as a;

You can see that I explicitly named my fields, but you can also do a select * to get all the fields from the OPENROWSET. I had to put the fields from the cube in quotation marks to make it work because of the brackets in the field names. For Analysis Services, MSOLAP is the name of my provider.  You must have the correct version of this provider listed under the linked servers on the server where this is being executed.

MSOLAP ProviderFor SQL Server 2012 SP1, you can download the Microsoft Analysis Services OLE DB Provider for Microsoft SQL Server 2012 SP1 here. There is nothing special about the data source and initial catalog. They are exactly the same connection info you would use when connecting to SSAS through Management Studio.  The last part is simply the MDX query.  You should be able to copy that part out and run it in Management Studio connected to your SSAS database and get results.

Inserting Images Into a Database

Technical need: I was building a demo for Power BI and I needed to show some images in my Power View reports in Office 365.  In order for the images to show in Office 365, you must import the images into SQL Server and then import them into Power Pivot. First I downloaded images of flags for each country from the CIA world factbook.  I took the folder of flags and put it in my pictures folder. I already had a table in SQL Server that contained country names and codes.  So I just needed add a flag for each country.

The query:

  CREATE TABLE [dbo].[countryflagsdemo]
  [id] int, 
  [country] varchar(100), 
  flagimage [varbinary](max)
 --get ID, country, and code from existing table
  INSERT INTO dbo.countryflagsdemo (id, country, code)
  SELECT id, country, code FROM dbo.Countryflag1;

  DECLARE @rows INT; 
  SET @i = 1;
--get max id value for loop
  SET @rows = (SELECT COUNT(country) FROM dbo.countryflagsdemo)
  WHILE (@i < @rows)

    SET @code = (SELECT code FROM dbo.countryflagsdemo WHERE [id] = @i);
    SELECT @sql = 'UPDATE dbo.countryflagsdemo 
      SET flagimage = 
      (SELECT BulkColumn FROM 
            ''C:\Users\mlongoria\Pictures\flags\large\' + @code 
               + '-lgflag.gif'', Single_Blob) as flagimage)
            WHERE  = ''' + @code + ''';'

    EXEC (@sql);
    SET @i = @i + 1; 
    --PRINT @code;


I figured out how to do this based upon this MSDN blog post. I inserted my other data first and then added the images.  I’m sure you could find a way to do this all in one select statement.  Since this was a one-time thing I didn’t see the need to find another way. It should be noted that the bulk insert must pull from a location on your computer; you can’t use links to the images on the internet. I downloaded the images in a folder. Each image name was the abbreviation for the country. If you are going to import this data into PowerPivot, be sure you change the table behavior properties so it will recognize the varbinary field as an image.


There may be other ways to accomplish my goals, but it’s nice to understand how to use OPENROWSET and have that in my toolbox.