Azure SQL DB, Azure SQL DW, Data Warehousing, Microsoft Technologies, Python, SQL Server

pandas.DataFrame.drop_duplicates and SQL Server unique constraints

I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.

Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.

Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").

This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.

Case sensitivity in SQL Server

SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.

Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.

CREATE TABLE [dbo].[Products] (
    [ProductName] nvarchar(100) NOT NULL
)
GO

ALTER TABLE [dbo].[Products] 
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)

I can insert my first row.

Insert into dbo.Products (ProductName)
values ('abc123')

Now I try to insert another row where the letters are capitalized.

Insert into dbo.Products (ProductName)
values ('ABC123')

This fails with the following error:

Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).

This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).

Potential solution

If you need to get to a list of case insensitive distinct values, you can do the following:

  1. Create a column that contains the values of the ProductName converted to lower case.
  2. Run the dropDuplicates on the lowercase column.
  3. Drop the lowercase column.

You can see code on this Stack Overflow solution as an example (as always, please read it and use at your own risk).

Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.

Data Warehousing, Microsoft Technologies, SSIS

My Preferences for SSIS Design

Lately, I have been using SSIS execution frameworks and Biml created by other people to populate data marts and data warehouses. It has taught me a few things and helped me clarify what I like and dislike compared to my usual framework. I’ve got the beginning of my preferences list started below. There are probably situations where I would want to deviate from my preferences, but I think they make a good starting point.

Populating Data

  • For self-service BI environments, a date dimension that doesn’t go out much further than the greatest date in your data. This can be a view or stored procedure that limits and updates dates rather than a static date dimension that goes out until the end of time.
  • Unknown values are included in normal dimension loads, not in separate scripts that must be run on deployment. This way, if an unknown value is ever left out or deleted, it will be added in the next data load rather than requiring a special execution of a script.
  • Every table should have InsertDateTime and UpdateDateTime columns. The UpdateDateTime column should be populated with the same value as the InsertDateTime column upon creation of the row, rather than being left null.
  • Whatever you use to create tables, include primary keys, foreign keys, and indexes with your table definitions. Provide explicit constraint names to simplify database comparisons. You can disable your foreign keys, but they need to be there to provide that metadata.
  • Separate your final dimensional/reporting tables from audit tables and staging tables. This can be done with separate schemas or even separate databases.

Data Integration Process

  • There should be consistent error handling in each layer (staging, dims, facts, etc.). If you write errors to another location (flat file, database table), have a process that notifies the right people that errors occurred. The process of consuming corrected data must be built, tested, and integrated into the existing process.
  • Make your error handling process reflect what end users need to see when an error occurs. Does it make sense to have a partial load when there is an issue? Or should it be all or nothing?
  • Have smart master packages that determine which packages to run. Don’t check whether the package should run inside of the package itself – do that in the master package.
  • Master packages should execute child packages in parallel as much as possible rather than defaulting to sequential execution.
  • Have an audit log with one row per package. Include the SSIS ServerExecutionID in the audit log – not the package -specific ID but the execution ID for the entire run. If there are incremental loads, the where clause used to filter the load should be captured in the audit table. Include row counts as well as package start and stop time in your audit log.
  • Add an AuditLogID column on your dimension, fact, and staging tables so you can trace each row back to the process that populated it.
  • For dims and facts, perform change detection/deduplication of records, usually through hash values and either SSIS lookups or SQL queries with WHERE NOT EXISTS.
  • Avoid T-SQL MERGE statements. Write individual insert/update/delete statements. This avoid any bugs in MERGE and makes your SQL easier to understand and troubleshoot.
  • Use consistent naming of tasks, source, destinations, packages, connection managers, etc. Connection managers pointing to databases should have names that refer to the database rather than the server.
  • If you are downloading files, move the files to an archive folder once files are processed. You can have rules in place if you have retention limits. But you probably need to keep files from at least the last load for audit and troubleshooting purposes. This could change if you are importing very sensitive data.
  • Even if you need to copy all columns from a table, write a select statement for database sources that explicitly names fields rather than using SELECT *. or just selecting the table or view.
  • SSIS lookups should use an explicit query rather than referencing an entire table.
  • Implement restartability at the package level for most packages (you should have single-purpose packages executed by a master package). Checkpoints are ineffective within a package. If you build your audit log table correctly, you can get the list of packages that have not run in the last X minutes/hours and feed that to your master package.
  • Send email from your scheduling tool rather than within an SSIS package.
  • Track data lineage in your tables. This can be as simple as having a table that lists all of your data sources with an ID column and including that ID value in each row of your staging, fact, and dimension tables.
  • Dims and facts are not truncated. Data should be inserted and updated (and deleted, if necessary).
  • Connection strings used in multiple packages should be project-level connection strings.

Biml Specifics

  • Understand whether you need a flexible Biml Framework or just an accelerator for a current project. If you need flexibility, don’t hardcode connection strings and other things that change when you add/change sources and destinations. If you just need to accelerate development of a simple data mart, total flexibility may be overkill and actually cause more work.
  • Have a single place where you add synthetic metadata, as much as possible. BimlScript gets messy and difficult to understand when you have some extended properties that are read in, some annotations added directly, and some variables defined in your code. This is why I like synthetic metadata stored in a database. Also, extended properties don’t exist in Azure SQL Data Warehouse, so if you need your framework to work there you can’t go that route.
  • Don’t repeat your code in multiple files. If you have some logic that gets reused, move it to a separate file and reference it from other files.

What Do You Think?

What’s on your SSIS preferences list? Do you disagree with one of my preferences and want to share your knowledge? Let’s chat in the comments.

Data Warehousing, Microsoft Technologies, SQL Server, SSIS, T-SQL

Why We Don’t Truncate Dimensions and Facts During a Data Load

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Surrogate Keys

Dimensional models use surrogate keys rather than natural keys. Surrogate keys are system-generated, meaningless values that are usually integers used to uniquely identify a record. They provide good performance for joins in queries, allow us to switch or use multiple source systems to feed the same tables, and facilitate the use of slowly changing dimensions. If you truncate a dimension table and then repopulate it, you will end up with different surrogate keys assigned to your dimension values. Let’s say we have a Geography dimension that looks like the table below.

GeographyKeyCityStateProvinceCountryPostalCode
1DenverColoradoUnited States80205
2DallasTexasUnited States75201
3ParisÎle-de-FranceFrance75001

If I truncate and reload this table, there is no guarantee that Denver will have a GeographyKey value of 1. I might reload the data and then have Paris as 1, Denver as 2, and Dallas as 3. And since we use surrogate keys as dimension lookups in fact tables and bridge tables (and snowflaked dimensions, if you use those), I would now need to update every table that references this dimension. That’s a lot of table updates without a good reason.

Error Handling

Another reason to avoid truncating the tables in your dimensional model is error handling. When you design your data load, you need to think about what should happen when it fails and where it is most likely to fail. Failures will often occur either in data transformation steps or upon inserting/updating values in the destination table due to data type/size conflicts.

Let’s say you truncate your fact table and then you encounter an ETL failure while performing the transformations to reload the table. Now you have an empty fact table. If you have error handling logic in your ETL, you may be able to redirect the error rows to another location to be handled later. But that still leaves you with an incomplete table at best. If instead of truncating and reloading, you were doing inserts and updates when you encountered an error, your table might reflect slightly stale data as of the previous load. You still might have a situation where your table is partially updated with the rows that were inserted before the error was encountered depending on how your ETL design. But having a fact table with old or partially updated data is usually (but not always) more preferable than having no data when a data load process fails.

Performance and Data Availability

When you truncate and reload a table, you are assuming you have access to all of the source data needed to fully repopulate the table. This may be true today, but what about next year when you switch CRM systems? Or what if your organization makes the decision to archive data in the source system that is older than 5 years? If any data needed for your fact or dimension becomes unavailable, your truncate and reload pattern will fail to serve your needs.

Even if you are sure your data will be available, you may want to consider your data load times. If you only have a handful of tables with a few hundred thousand rows max, you may load your data warehouse in a few minutes. But what happens when you have a few hundred million rows with some complicated transformation logic in your pipelines? You might be adding minutes to hours to your load times.

Why Do ETL Developers End Up Truncating Fact Tables?

Sometimes developers just don’t know better. But often there were understandable reasons for using the truncate and reload pattern. While I have never built a system where I truncated a dimension, I have had a couple where we at least started out with truncating the fact table. We usually built it to make data initially available with correct values, and then worked with stakeholders to find a different way to access the data or to have them generate the data differently to alleviate the problems that caused us to want to truncate and reload.

Often the cause is data quality issues. On one project I worked, we had dimension values that defined the granularity of the fact table that would change in the source systems, and the process to try to update them in the fact table was too complicated and took too long. So we made the decision to truncate and reload, understanding the risks of doing so. Having the data available and usable for analysis gave us more information on how useful the data was to users and helped us work to understand why and how our keys were changing. Over time, we were able to influence the way data was entered into source systems so that we didn’t have to go through the truncate and reload process and could perform incremental loads on our fact table.

Sometimes source systems allow hard deletion of data and don’t maintain a list of what was deleted. And depending on access methods and the size of the data, it can be difficult to compare the rows in a fact table with the source data to identify the deleted rows. I would argue this is a poor design of the source system, but we often can’t change that, and we still need to include the data in our data warehouse. So it’s understandable why someone in this situation would want to truncate and reload a fact table.

If your stakeholders are ok with the potential of long load times, empty tables when errors occur, and the assumption that source data will always be available, then there isn’t much problem with truncating and reloading a fact table. But that is often not the case. And that decision should be made explicitly rather than implicitly.

What To Do Instead of Truncating and Reloading a Fact Table

If you’ve been truncating and reloading a fact table because that just seemed like the simplest thing to do, you can change your load pattern.

My general load pattern is:

  1. Truncate the update table
  2. Insert new rows into the fact table and changed rows into the update table
  3. Perform set-based updates on the fact table based upon the data in the update table

I’m able to determine what rows are new, changed, and unchanged by using hash values. I concatenate and hash the values that define the level of uniqueness of the row into one value I call HistoricalHash and the values from the remaining columns into a value I call ChangeHash. In other words, I load a transactional or periodic snapshot fact table in a manner similar to a Type 1 slowly changing dimension.

If you have data quality, data deletion, or other issues that prevent you from using a change detection pattern like the above, consider using a staging table and swapping it out with the fact table. Create a staging table that looks exactly like your fact table.

  1. Truncate the staging table
  2. Populate the staging table with all rows that should be in the fact table
  3. Swap the staging table and the fact table (this usually involves renaming the tables)

Understand the Consequences of Your Design

If you are choosing to truncate a dimension or fact table, be sure that you understand the trade-offs. You may be unwittingly opting for simplicity of ETL over data availability and efficient data loads.

Azure, Azure SQL DW, Data Warehousing, Microsoft Technologies, T-SQL

What You Need to Know About Data Classifications in Azure SQL Data Warehouse

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.

ADD SENSITIVITY CLASSIFICATION TO
    dbo.DimCustomer.Phone
    WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')

To view existing data classifications, you can query the sys.sensitivity_classifications view or look in the Azure Portal.

SELECT
sys.all_objects.name as [TableName], 
sys.all_columns.name as [ColumnName],
[Label], 
[Information_Type], 
FROM sys.sensitivity_classifications
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
    and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id

Be Careful When Loading With CTAS and Rename

One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:

  1. Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
  2. Rename the dimension table to something like Dimension_OLD
  3. Rename the upsert table to Dimension
  4. Drop the Dimension_OLD table
Animation of a table load process in Azure SQL DW


In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.

Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.

There are a few complicating factors:

  • There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
  • ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
  • The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
  • SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.

The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.

Update: Someone asked about the name SwapWithMetadata and why it doesn’t specifically mention sensitivity classifications. I didn’t mention classifications because there are other things that need this same treatment. Dynamic data masking will also need to be reapplied. With dynamic data masking, it will be even more important to add it back immediately after swapping the tables rather than waiting for a full data load of all selected tables to finish and adding all classifications back. If your load takes a long time or the process fails on another table, you don’t want your data exposed without a mask to users who shouldn’t see the full information.

CREATE PROC SwapWithMetadata
@SrcSchema NVARCHAR(128),
@SrcTable NVARCHAR(128),
@DestSchema NVARCHAR(128),
@DestTable NVARCHAR(128),
@TransferMetadata BIT,
@DropOldTable BIT
AS
BEGIN
SET NOCOUNT ON
BEGIN TRY
–Check if destination table exists
DECLARE @DestSchemaQualifiedTableName NVARCHAR(257)
SET @DestSchemaQualifiedTableName = @DestSchema + '.' + @DestTable
IF OBJECT_ID(@DestSchemaQualifiedTableName) IS NULL
BEGIN
DECLARE @DestErr NVARCHAR(MAX)
SET @DestErr = 'Table ' + @DestSchemaQualifiedTableName + ' not found'
RAISERROR(@DestErr, 15, 1)
END
–Check if source table exists
DECLARE @SrcSchemaQualifiedTableName NVARCHAR(257)
SET @SrcSchemaQualifiedTableName = @SrcSchema + '.' + @SrcTable
IF OBJECT_ID(@SrcSchemaQualifiedTableName) IS NULL
BEGIN
DECLARE @SrcErr NVARCHAR(MAX)
SET @SrcErr = 'Table ' + @SrcSchemaQualifiedTableName + ' not found'
RAISERROR(@SrcErr, 15, 1)
END
–Move destination table to destination_old. Move source table to destination
DECLARE @RenameSql NVARCHAR(MAX)
SET @RenameSql = 'RENAME OBJECT ' + @DestSchemaQualifiedTableName + ' TO ' + @DestTable + '_old; '
Set @RenameSql = @RenameSql + ' RENAME OBJECT ' + @SrcSchemaQualifiedTableName + ' TO ' + @DestTable
PRINT 'Executing ' + @RenameSql + ' …'
EXEC sp_executesql @RenameSql;
–drop temp table if it exists
IF OBJECT_ID('tempDB..#tempApplySensitivityClassificationsToTable') IS NOT NULL
DROP TABLE #tempApplySensitivityClassificationsToTable;
–check if we should transfer data classifications from old to new table
IF ISNULL(@TransferMetadata,0) = 1
BEGIN
–put current classifications in a temp table
DECLARE @OldTable NVARCHAR(128) = @DestTable + '_old';
WITH CurrentClassifications as (
SELECT
CAST('dbo' as NVARCHAR(128)) [Schema],
CAST(sys.all_objects.name as NVARCHAR(128)) [Table],
CAST(sys.all_columns.name as NVARCHAR(128)) [Column],
CAST([Information_Type] as NVARCHAR(128)) [Informationtype],
CAST([Label] as NVARCHAR(128)) [Label]
FROM
sys.sensitivity_classifications
LEFT OUTER JOIN sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
LEFT OUTER JOIN sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id
)
SELECT ROW_NUMBER() OVER (ORDER BY [Schema], [Table], [Column]) [ID],
[Schema], [Table], [Column], [Informationtype], [Label]
INTO #tempApplySensitivityClassificationsToTable
FROM CurrentClassifications
WHERE [Schema] = @DestSchema AND [Table] = @OldTable;
DECLARE @i INT
SET @i = 1
DECLARE @Max INT
SELECT @Max = COUNT(*)
FROM #tempApplySensitivityClassificationsToTable;
PRINT 'Transferring ' + CAST(@Max as VARCHAR(4)) + ' classifications'
–drop and recreate sensitivity classifications
DECLARE @Sql NVARCHAR(MAX)
DECLARE @Col NVARCHAR(128)
DECLARE @InfoType NVARCHAR(128)
DECLARE @Label NVARCHAR(128)
WHILE @i <= @Max
BEGIN
SELECT @Col = [Column], @InfoType = [InformationType], @Label = [Label]
FROM #tempApplySensitivityClassificationsToTable
WHERE Id = @i
SET @Sql = 'DROP SENSITIVITY CLASSIFICATION FROM ' + @DestSchemaQualifiedTableName + '.' + @Col
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
SET @Sql = 'ADD SENSITIVITY CLASSIFICATION TO ' + @DestSchemaQualifiedTableName + '.' + @Col
IF (@InfoType IS NOT NULL AND @Label IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (LABEL=''' + @Label + ''', INFORMATION_TYPE=''' + @InfoType + ''')'
END
ELSE IF (@InfoType IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (INFORMATION_TYPE=''' + @InfoType + ''')'
END
ELSE IF (@Label IS NOT NULL)
BEGIN
SET @Sql = @Sql + ' WITH (LABEL=''' + @InfoType + ''')'
END
ELSE
BEGIN
SET @Sql = NULL
END
IF (@Sql IS NOT NULL)
BEGIN
PRINT 'Executing ' + @Sql + '…'
EXEC sp_executesql @Sql
END
SET @i = @i + 1
END
END
IF ISNULL(@DropOldTable,0) = 1
BEGIN
DECLARE @DropSql NVARCHAR(MAX)
SET @DropSql = 'DROP TABLE ' + @DestSchemaQualifiedTableName + '_old;'
PRINT 'Executing ' + @DropSql + '…'
EXEC sp_executesql @DropSql;
END
END TRY
BEGIN CATCH
Print 'ERROR… Procedure: ' + ERROR_PROCEDURE() + ' Message: ' + ERROR_MESSAGE()
END CATCH
END

Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.

For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.

Vote for My Enhancement Idea

If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.

Not an Issue with Other Data Load Methods

Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.

Give Data Classifications a Try

I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.

Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?

Consulting, Data Warehousing, Microsoft Technologies

Ten Ways To Help Your BI Consultant Be Successful

I’ve been working in the field of business intelligence for over ten years, as a consultant for over five years. One thing I’ve learned from that time is that consultants need the client’s help to complete a project on time and on budget. Even if the consultants are doing the bulk of the work, project owners and stakeholders have a large impact on the project.

When you hire a business intelligence consultant, both you and the consultant want to see your project succeed. While a good consultant can guide you through important decisions and manage a BI project in addition to doing the technical development work, they need your help to get the project started off right and to continue to meet deadlines and requirements. A BI project requires collaboration between the consultant and the client. It’s usually not the type of thing you can throw over the fence and get back a satisfactory solution. We need to understand your business and how you think and talk about your data in order to give it meaning and make it accessible in a data model or report.

In my experience, we consultants write assumptions and project prerequisites into Statements of Work (SOWs) and mention them on planning calls, but we don’t always emphasize how important they are to project success. And we’ll often work around missing prerequisites to try to keep to the project schedule as best as possible, to varying degrees of success. As a client, your organization has allocated budget to your BI project that could have gone to other priorities. We understand this and are motivated the use that budget to accomplish your project goals, but we often spend a lot of project time overcoming obstacles related to lack of access to environments and technical assets and lack of client/stakeholder involvement. The problem/opportunity is already challenging or you wouldn’t need a BI consultant, so why not do what you can to remove barriers that are within your control and help steer the project toward success?

With the help of a few co-workers/work friends I’ve compiled a list of ten ways you can help your BI consultant (and therefore your BI project) be successful. Special thanks to Josh Roll, Melissa Coates, and Levi Syck for your input and feedback on this list.

  1. Have your data sources ready before you start. A good consultant can get started with design and stub things out or use fake data, but it will take us longer and quality will be questionable until we get our hands on real/realistic data.
  2. Work out data access (network/Azure/Power BI logins, VPN access, database access, etc.) for your consultants ahead of time, not on the day of project kick-off.  So many projects get stalled at the outset because the consultants don’t have access to the data and environments they need.
  3. Help your consultant understand any political/departmental boundaries too.  If you know that some department owns some needed data and that they are possessive about it, be up front and consider ways to get them involved, rather than leaving us to go and blunder through, possibly stepping on toes in the process. Provide context for the project. How does it help your organization achieve its larger goals? Who came up with the idea? Has something similar been tried before? Consultants get to do similar projects at different companies, so they bring good experience and ideas for overcoming technical and organizational challenges.
  4. Make sure you understand the time commitment of a BI project and make sure project owners, technical contacts, and subject matter experts can be available as needed. Be involved throughout the project, but especially during user acceptance testing to ensure our solution covers your use cases.
  5. Be able to define success criteria. You may not be able to dictate all the business and technical requirements, but you should be able to work out what success looks like on your project. Your consultant can help you define success, but things will go better if you have given this some thought beforehand.
  6. If you have existing database or ETL frameworks or naming conventions you would like to be used, make sure they are documented, or make someone available during the first few days of the project to explain them and answer questions. Don’t leave your consultant to guess.
  7. If your consultant sends you project planning and requirements documents up front, rather than after the fact, they are using these documents to establish understanding and agreement. Take the time to read them and ask questions. As consultants, we have a limited amount of time to become well acquainted with your data and use-cases, and we operate under the assumption that you will steer us in the right direction if you see us veering off the path.
  8. Be aware of your data hygiene. If your data is incomplete or dirty we’re going to need your help deciding how to handle it.
  9. Plan for an iterative development process. Know that everything probably won’t be perfect the first time. We probably can’t fit everything into the initial scope. Make sure there is room in the timeline for testing and rework. Generally, iterative projects have a higher chance of success than very large big bang projects. You can still get to the larger vision, just know that we will probably ask you to break it up into smaller, more manageable chunks. Also, be prepared to make decisions in the face of ambiguity.  Not all architecture and design decisions can be made with absolute certainty. But they often need to be decided to move forward and can be adjusted down the line, if necessary when priorities change or new information comes to light.
  10. Identify who will support the solution after the consultant is gone. Involve that person or team early. It’s better for the support team to learn about the solution over a period of weeks or months, rather than cramming everything into a knowledge transfer session and a document at the end of the engagement. If you don’t have anyone to support the solution, be honest with yourself and request a separate support contract up front and factor it into budget requests or allocations.

I hope you’ll find this list useful in planning your next engagement with a BI consultant. If you are a BI consultant or have worked with a BI consultant, please leave a comment about what you would add to this list.

Azure, Azure Data Factory, Biml, Data Warehousing, Microsoft Technologies

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

Note: This post is about Azure Data Factory V1
Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source
  • Linked Services – define the connection string and other connection properties for each source and destination
  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
  • Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

  • Frequency (daily or hourly)
  • Changes Only (incremental or full load)
  • Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.


<#@ property name="schema" type="string" #>
<#@ property name="table" type="string" #>
<#@ property name="frequency" type = "string" #>
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json&quot;,
"name": "DS_OnPremSQL_MySourceDB_<#=schema#>_<#=table#>",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "LS_OnPremSQL_MySourceDB",
"structure": [],
"external": true,
"typeProperties": {
"tableName": "<#=schema#>.<#=table#>"
},
"availability": {
<# if (frequency == "Daily") {#>
"frequency": "Day",
<#} else if (frequency == "Hourly") {#>
"frequency": "Hour",
"style": "EndOfInterval",
<# } #>
"interval": 1
}
}
}


<#@ property name="schema" type="string" #>
<#@ property name="table" type="string" #>
<#@ property name="frequency" type = "string" #>
<#@ property name="scope" type="string" #>
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json&quot;,
"name": "DS_DataLake_ClawfootSupply_<#=schema#>_<#=table#>",
"properties": {
"type": "AzureDataLakeStore",
"linkedServiceName": "LS_ADLS_SHDataLake",
"structure": [ ],
"typeProperties": {
"folderPath": "/raw/clawfootsupply/<#=schema#>_<#=table#>/<#=scope.ToLower()#>/{Year}/{Month}/{Day}/",
"fileName": "<#=schema#>_<#=table#>_{Hour}{Minute}{Second}.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": "|",
"quoteChar": "\"",
"nullValue": ""
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "%M"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "%d"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "hh"
}
},
{
"name": "Minute",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "mm"
}
},
{
"name": "Second",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "ss"
}
}
]
},
"availability": {
<# if (frequency == "Daily") {#>
"frequency": "Day",
<#} else if (frequency == "Hourly") {#>
"frequency": "Hour",
"style": "EndOfInterval",
<# } #>
"interval": 1
}
}
}

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets.  You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads.  I’ll discuss lines 100 – 119 in my next post.


<#@ template tier="10" #>
<#@ import namespace="System.Data" #>
<#@ import namespace="System.Text" #>
<#@ code file="BGHelper.cs" #>
<#@ import namespace="BGHelper" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd"&gt;
</Biml>
<#
string mdFilePath = "C:\\Users\\admin\\Source\\Workspaces\\SH Data Warehouse\\metadata";
string mdFileName = "TargetTableMetadata.xlsx";
string mdWorkSheetName = "Metadata$";
bool mdHasHeader = true;
string logPath = "C:\\Users\\admin\\Source\\Workspaces\\Data Warehouse\\data_factory\\generate_data_factory_biml\\log.txt";
string adfProjPath = "C:\\Users\\admin\\Source\\Workspaces\\Data Warehouse\\data_factory\\data_factory\\";
DataSet ds = new DataSet();
ds = ExcelReader.ReadExcelQuery(mdFilePath, mdFileName, mdWorkSheetName, mdHasHeader);
System.IO.File.AppendAllText(@logPath, "MetaData File Path: " + System.IO.Path.Combine(mdFilePath, mdFileName).ToString() +Environment.NewLine );
System.IO.File.AppendAllText(@logPath, "MetaData File Path: " + System.IO.File.Exists(System.IO.Path.Combine(mdFilePath, mdFileName)).ToString() +Environment.NewLine );
System.IO.File.AppendAllText(@logPath, "Dataset table count: " + ds.Tables.Count.ToString() + Environment.NewLine);
DataView dailyFulls = new DataView(ds.Tables["Metadata"],"Frequency = 'Daily' and [Changes Only] = 'No'","", DataViewRowState.CurrentRows);
DataView dailyDeltas = new DataView(ds.Tables["Metadata"], "Frequency = 'Daily' and [Changes Only] = 'Yes'", "", DataViewRowState.CurrentRows);
DataView hourlyFulls = new DataView(ds.Tables["Metadata"], "Frequency = 'Hourly' and [Changes Only] = 'No'", "", DataViewRowState.CurrentRows);
DataView hourlyDeltas = new DataView(ds.Tables["Metadata"], "Frequency = 'Hourly' and [Changes Only] = 'Yes'", "", DataViewRowState.CurrentRows);
//log count of results for each filter
System.IO.File.AppendAllText(@logPath, "Daily Fulls Count: " + dailyFulls.Count.ToString() + Environment.NewLine);
System.IO.File.AppendAllText(@logPath, "Daily Deltas Count: " + dailyDeltas.Count.ToString() + Environment.NewLine);
System.IO.File.AppendAllText(@logPath, "Hourly Fulls Count: " + hourlyFulls.Count.ToString() + Environment.NewLine);
System.IO.File.AppendAllText(@logPath, "Hourly Deltas Count: " + hourlyDeltas.Count.ToString() + Environment.NewLine);
//Generate datasets
foreach (DataRowView rowView in dailyFulls)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "full";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
foreach (DataRowView rowView in dailyDeltas)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "deltas";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
foreach (DataRowView rowView in hourlyFulls)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "full";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
foreach (DataRowView rowView in hourlyDeltas)
{
DataRow row = rowView.Row;
string schemaName = row["SchemaName"].ToString();
string tableName = row["TableName"].ToString();
string frequency = row["Frequency"].ToString();
string scope = "deltas";
System.IO.File.AppendAllText(@logPath, "DailyFulls | " + row["SchemaName"].ToString() + "." + row["TableName"].ToString() + " | " + row["ColumnListForSelect"].ToString() + Environment.NewLine);
System.IO.File.WriteAllText(@adfProjPath + "DS_OnPremSQL_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_OnPremSQL_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency));
System.IO.File.WriteAllText(@adfProjPath + "DS_DataLake_MySourceDB_" + schemaName + "_" + tableName + ".json", CallBimlScript("DS_DataLake_MySourceDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}
// Generate pipelines
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Daily_Full.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", dailyFulls, "Daily", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Daily_Deltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", dailyDeltas, "Daily", "Deltas"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Hourly_Full.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", hourlyFulls, "Hourly", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_Hourly_Deltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS.biml", hourlyDeltas, "Hourly", "Deltas"));
//Generate One-Time Pipelines
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_DailyFulls.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", dailyFulls, "Daily", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_DailyDeltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", dailyDeltas, "Daily", "Deltas"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_HourlyFulls.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", hourlyFulls, "Hourly", "Full"));
System.IO.File.WriteAllText(@adfProjPath + "PL_Copy_MySourceDBToADLS_OneTime_HourlyDeltas.json", CallBimlScript("PL_Copy_MySourceDBToADLS_OneTime.biml", hourlyDeltas, "Hourly", "Deltas"));
#>

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.

https://gist.github.com/mlongoria/b1ce4334a793cca3d1b8c76c51c25dbe#file-adf_adls_dataset-biml

Azure, Data Warehousing, Microsoft Technologies, T-SQL

Create a Date Dimension in Azure SQL Data Warehouse

Most data warehouses and data marts require a date dimension or calendar table. Those of us that have been building data warehouses in SQL Server for a while have collected our favorite scripts to build out a date dimension. For a standard date dimension, I am a fan of Aaron  Bertrand’s script posted on MSSQLTips.com. But the current version (as of Aug 8, 2016) of Azure SQL Data Warehouse doesn’t support computed columns, which are used in Aaron’s script.

I adapted Aaron’s script to work in Azure SQL Data Warehouse and am sharing it with you below, so you don’t have to do the same. I did leave out holidays because I didn’t need them in my calendar table.  To add them back in, just add the column back to the table and use Aaron’s update statements. I also changed the MonthYear field to include a space between the month and year. Otherwise, my script should produce the same results as Aaron’s in a Azure SQL DW friendly way. Notice that I chose to use Round Robin distribution for my date table. Also, tables are now created with clustered columnstore indexes by default in Azure SQL DW.


Creates a table called RPT.Calendar. Change the table name on line 69. Change date range on line 2.
DECLARE @StartDate DATE = '20100101', @NumberOfYears INT = 30;
prevent set or regional settings from interfering with
interpretation of dates / literals
CREATE TABLE #dimdate
(
[date] DATE,
[day] tinyint,
[month] tinyint,
FirstOfMonth date,
[MonthName] varchar(12),
[week] tinyint,
[ISOweek] tinyint,
[DayOfWeek] tinyint,
[quarter] tinyint,
[year] smallint,
FirstOfYear date,
Style112 char(8),
Style101 char(10)
);
SET DATEFIRST 7;
SET DATEFORMAT mdy;
SET LANGUAGE US_ENGLISH;
DECLARE @CutoffDate DATE = DATEADD(YEAR, @NumberOfYears, @StartDate);
this is just a holding table for intermediate calculations:
use the catalog views to generate as many rows as we need
INSERT #dimdate([date])
SELECT d
FROM
(
SELECT d = DATEADD(DAY, rn 1, @StartDate)
FROM
(
SELECT TOP (DATEDIFF(DAY, @StartDate, @CutoffDate))
rn = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
on my system this would support > 5 million days
ORDER BY s1.[object_id]
) AS x
) AS y;
UPDATE #DimDate
set
[day] = DATEPART(DAY, [date]),
[month] = DATEPART(MONTH, [date]),
FirstOfMonth = CONVERT(DATE, DATEADD(MONTH, DATEDIFF(MONTH, 0, [date]), 0)),
[MonthName] = DATENAME(MONTH, [date]),
[week] = DATEPART(WEEK, [date]),
[ISOweek] = DATEPART(ISO_WEEK, [date]),
[DayOfWeek] = DATEPART(WEEKDAY, [date]),
[quarter] = DATEPART(QUARTER, [date]),
[year] = DATEPART(YEAR, [date]),
FirstOfYear = CONVERT(DATE, DATEADD(YEAR, DATEDIFF(YEAR, 0, [date]), 0)),
Style112 = CONVERT(CHAR(8), [date], 112),
Style101 = CONVERT(CHAR(10), [date], 101)
;
CREATE TABLE RPT.Calendar
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
DateKey = CONVERT(INT, Style112),
[Date] = [date],
[Day] = CONVERT(TINYINT, [day]),
DaySuffix = CONVERT(CHAR(2), CASE WHEN [day] / 10 = 1 THEN 'th' ELSE
CASE RIGHT([day], 1) WHEN '1' THEN 'st' WHEN '2' THEN 'nd'
WHEN '3' THEN 'rd' ELSE 'th' END END),
[Weekday] = CONVERT(TINYINT, [DayOfWeek]),
[WeekDayName] = CONVERT(VARCHAR(10), DATENAME(WEEKDAY, [date])),
[DOWInMonth] = CONVERT(TINYINT, ROW_NUMBER() OVER
(PARTITION BY FirstOfMonth, [DayOfWeek] ORDER BY [date])),
[DayOfYear] = CONVERT(SMALLINT, DATEPART(DAYOFYEAR, [date])),
WeekOfMonth = CONVERT(TINYINT, DENSE_RANK() OVER
(PARTITION BY [year], [month] ORDER BY [week])),
WeekOfYear = CONVERT(TINYINT, [week]),
ISOWeekOfYear = CONVERT(TINYINT, ISOWeek),
[Month] = CONVERT(TINYINT, [month]),
[MonthName] = CONVERT(VARCHAR(10), [MonthName]),
[Quarter] = CONVERT(TINYINT, [quarter]),
QuarterName = CONVERT(VARCHAR(6), CASE [quarter] WHEN 1 THEN 'First'
WHEN 2 THEN 'Second' WHEN 3 THEN 'Third' WHEN 4 THEN 'Fourth' END),
[Year] = [year],
MMYYYY = CONVERT(CHAR(6), LEFT(Style101, 2) + LEFT(Style112, 4)),
MonthYear = CONVERT(CHAR(8), LEFT([MonthName], 3) + ' ' + LEFT(Style112, 4)),
FirstDayOfMonth = FirstOfMonth,
LastDayOfMonth = MAX([date]) OVER (PARTITION BY [year], [month]),
FirstDayOfQuarter = MIN([date]) OVER (PARTITION BY [year], [quarter]),
LastDayOfQuarter = MAX([date]) OVER (PARTITION BY [year], [quarter]),
FirstDayOfYear = FirstOfYear,
LastDayOfYear = MAX([date]) OVER (PARTITION BY [year]),
FirstDayOfNextMonth = DATEADD(MONTH, 1, FirstOfMonth),
FirstDayOfNextYear = DATEADD(YEAR, 1, FirstOfYear)
FROM #dimdate
;
DROP Table #dimdate;

Instead of the computed columns, I inserted the dates, then updated the other columns in the temporary table. Then I used the Create Table As Select syntax to create my final RPT.Calendar table.

Happy time trending!

BIDS Helper, Biml, Data Warehousing, Microsoft Technologies, SQL Server, SSIS

Type 6 or Hybrid Type 2 Slowly Changing Dimension with Biml

In my previous post, I provided the design pattern and Biml for a pure Type 2 Slowly Changing Dimension (SCD). When I say “pure Type 2 SCD”, I mean an ETL process that adds a new row for a change in any field in the dimension and never updates a dimension attribute without creating a new row.  In practice, I tend to create more hybrid Type 2 SCDs where updates to some attributes require a new row and others update the value on the existing rows. A similar pattern that I find I implement more often than a pure Type 2 is a Type 6 SCD. A Type 6 SCD builds on the Type 2 technique by adding current attributes alongside the historical attributes so related measures can be grouped by the historical or current dimension attribute values. The only difference between what I call a hybrid Type 2 and a Type 6 is that in the Type 6, there are no Type 1 attributes in the dimension that do not also have a Type 2 version in the dimension to capture the historical values.

Design Pattern

Control Flow

If you are comfortable with my design pattern for a pure Type 2 SCD in which a change of value in any column causes a new row, this pattern is quite similar. And the control flow is exactly the same. This pattern, as with my pure Type 2, assumes that rows are not deleted from the source system. You could easily modify this to check for deleted rows if needed.

Control Flow for a Hybrid Type 2 or Type 6 Dimension
Control Flow for a Hybrid Type 2 or Type 6 Dimension

The steps in the control flow are:

  1. Log package start.
  2. Truncate the update table.
  3. Execute the data flow.
  4. Execute the update statements to update columns and insert new rows.
  5. Log package end.

The update statements are different in this pattern, and I’ll explain those in detail below.

Data Flow

The data flow looks like a pure Type 2 SCD, with the exception of an added derived column transformation and minor changes to the lookup and conditional split. Again, I use views to integrate the data, apply business logic, and add hashkeys for change detection. Then I use SSIS to perform the mechanics of loading the data.

The steps in this data flow task are:

  1. Retrieve data from my source view.
  2. Count the rows for package logging purposes.
  3. Perform a lookup to see if the entity already exists in the dimension table.
  4. If the entity doesn’t exist at all in the dimension table, it goes into the left path where I count the number of rows, add a derived column that sets the row start date to “01/01/1900 00:00:00”, and then insert the row into the dimension table.
  5. If the entity does exist in the table, I check it for changes.
  6. If there are changes to the entity, I count the number of rows, us a derived column to flag the type(s) of changes to make, and then insert the row into an update table.
  7. Entities with no changes are simply counted for audit purposes.

The Source View

This SSIS pattern requires 3 hashed values for for change detection:

  • HistoricalHashKey: the unique identifier of the entity, the natural key that ties the historical rows together
  • ChangeHashKey: the columns on the dimension that cause a new row to be created and the current row to be expired
  • UpdateHashKey: the columns on the dimension that should be updated in place

In my example view below, the Route ID and Warehouse identify a unique route.  The supervisor, route description and route type are all important attributes of the route.  The route area identifies the metro area in which a route is located. If this should change, there is no need to see previous values; we just want to see the current value on every row.

 CREATE View [dbo].[StgDimRoute] as   
    with [Routebase] as   
    (   SELECT [RouteID]  
    , [RouteDescription]  
    , [Supervisor]  
    , [RouteType]  
    , [Warehouse]  
    , [RouteArea]  
     , 1 as RowIsCurrent     
  FROM Stg.Routes R  
  Left JOIN dbo.StgWarehouse W ON W.WarehouseID = R.WarehouseID  
  UNION   
  select -1 as RouteID, 'Unknown' as RouteDescription, 'Unknown' as Supervisor,   
  'Unknown' as RouteType, 'Unknown' as Warehouse, 'Unknown' as RouteArea, 
  1 as RowIsCurrent 
      ) ,  
 Routedata as   
 (  
 Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse, 
   RouteArea, RowIsCurrent 
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT([RouteID], ' ',Warehouse ))) 
      AS HistoricalHashKey  
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(Supervisor, [RouteDescription], 
      RouteType))) AS ChangeHashKey  
  , Convert(VARBINARY(20), HASHBYTES('MD5', CONCAT(Area,' '))) AS UpdateHashKey  
  from Routebase sb  
  )   
 Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse,  
  RouteArea, RowIsCurrent  
  , HistoricalHashKey, ChangeHashKey, UpdateHashKey  
  , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  , CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
  , CONVERT(VARCHAR(34), UpdateHashKey, 1) AS UpdateHashKeyASCII  
  , dateadd(MINUTE, -1, current_timestamp) as RowEndDate
  , CURRENT_TIMESTAMP as RowStartDate   
  from Routedata ds  

The RowEndDate value in this view will be used for routes that require a current row to be expired since my pattern is the leave the end date of the current row null.

The Change Detection Lookup

The lookup in my DFT compares the HistoricalHashKeyASCII column from the source view with the varchar version of the HistoricalHashKey from the dimension table and adds two lookup columns: lkp_ChangeHashKeyASCII and lkp_UpdateHashKeyASCII to the data set.

Type 6 historical hash lookup

Rows that do not match are new rows; i.e., that route has never been in the dimension table before.  Rows that do match have a row in the dimension table and will then be evaluated to see if there are any changes in the values for that route.

Derived Column for New Rows

The no match output of the lookup are new rows for routes that are not in the dimension table. Since this is the first row in the table for that route, we want this row to be effective from the beginning of time until the end of time. The beginning of time in this data mart is 01/01/1900. Since the data is loaded multiple times per day, I use a date/time field rather than a date. If you only need the precision of a day, you can cut your row end date/time back to just a date. In my pattern, the current row has a null row end date, but you could easily add a derived column to set the end date to 12/31/9999 if you prefer.

Derived Column for Start Date

Conditional Split for Change Detection

This time, I have to check to see if both the ChangeHashKeyASCII and the UpdateHashKeyASCII match in my conditional split.

Type 6 CSPL

If both hashed values from the source view match the hashed values from the dimension table, then no changes are required and the row is simply counted for audit purposes.

If either hashed value doesn’t match, there is an update to be made.

Derived Column to Identify Change Types

We again compare the UpdateHashKeyASCII value from the source view with that of the dimension. If they don’t match, we set the UpdateInPlace flag to true. If the ChangeHashKeyASCII values don’t match, we set the UpdateNewRow flag to true. If a row has both types of changes, both types of updates will be made.

My update table contains the UpdateInPlace and UpdateNewRow columns, so I can reference these flags in my update statements.

The Update Statements

The update statements in the control flow take the changes from the update table and apply them to the dimension table. Three statements are executed in the Execute SQL Statement labeled SQL Update DimRoute.

 UPDATE A   
 SET  A.RouteArea = UA.RouteArea,   
      A.UpdateHashKey = UA.UpdateHashKey,   
      A.UpdtDtTm = CURRENT_TIMESTAMP   
 FROM  Updt.UpdtRoute AS UA   
     JOIN Dbo.DimRoute AS A   
      ON UA.HistoricalHashKey = A.HistoricalHashKey   
 WHERE UpdateInPlace = 1; 
  
UPDATE A   
 SET  RowIsCurrent = 0,   
     A.RowEndDate = UA.RowEndDate,   
     A.UpdtDtTm = CURRENT_TIMESTAMP   
 FROM  Updt.UpdtRoute UA   
     JOIN Dbo.DimRoute AS A   
      ON UA.HistoricalHashKey = A.HistoricalHashKey   
 WHERE A.RowIsCurrent = 1   
     AND UA.UpdateNewRow = 1;   

 INSERT INTO Dbo.DimRoute   
       (RouteID,   
        RouteDescription,   
        Supervisor,   
        RouteType,   
        Warehouse,   
        RouteArea,   
        RowIsCurrent,   
        RowStartDate,   
        RowEndDate,   
        HistoricalHashKey,   
        ChangeHashKey,   
        UpdateHashKey)   
 SELECT RouteID,   
        RouteDescription,   
        Supervisor,   
        RouteType,   
        Warehouse,   
        RouteArea,   
        1,   
        RowStartDate,   
        null,   
        HistoricalHashKey,   
        ChangeHashKey,   
        UpdateHashKey
FROM Updt.UpdtRoute WHERE UpdateNewRow = 1; 

The first statement updates the values for the columns designated to be updated in place by joining the update table to the dimension table based on the HistoricalHashKey column. This is the same as performing updates in a Type 1 SCD.

The second statement expires all the rows for which a new row will be added. The third statement inserts the new rows with the RowIsCurrent value set to 1 and the RowEndDate set to null.

The Biml

If you are using Biml, you know that you can create a design pattern for this type of dimension load and reuse it across multiple projects. This speeds up development and ensures that your Type 2 Hybrid or Type 6 dimensions are implemented consistently.

As usual, I have 3 Biml files that are used to create the SSIS package:

  • ProjectConnections.biml – contains all the project-level connections for the project
  • Dim2Hybrid.biml – contains the SSIS design pattern with code nuggets that parameterize it to make it reusable
  • CreateDim2HybridPackages.biml – calls Dim2Hybrid.biml and passes along the values to be used for each package

I’ve pasted Dim2Hybrid and CreateDim2Hybrid below.

Dim2Hybrid.Biml

 <#@ template language="C#" tier="2" #>  
 <#@ property name="PackageName" type="String" #>  
 <#@ property name="DstSchemaName" type="String" #>  
 <#@ property name="DstTableName" type="String" #>  
 <#@ property name="DstConnection" type="String" #>  
 <#@ property name="DataFlowSourceName" type="String" #>  
 <#@ property name="SrcConnection" type="String" #>  
 <#@ property name="SourceQuery" type="String" #>     
 <#@ property name="UpdateSchemaName" type="String" #>     
 <#@ property name="UpdateTableName" type="String" #>     
 <#@ property name="UpdateConnection" type="String" #>        
 <#@ property name="UpdateSQLStatement" type="String" #>      
     <Package Name="<#=PackageName#>" Language="None">  
   <Parameters>  
     <Parameter DataType="String" Name="ParentPackageID">00000000-0000-0000-0000-000000000000</Parameter>  
   </Parameters>  
   <Variables>  
     <Variable EvaluateAsExpression="true" DataType="String" IncludeInDebugDump="Exclude" Name="QualifiedTableSchema">"[" + @[User::SchemaName] + "].[" + @[User::TableName] + "]"</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditStart">EXECUTE [dbo].[PackageControlStart] @PackageName = ?, @PackageId = ?, @ParentPackageId = ?, @ExecutionId = ?, @StartTime = ?;</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditUpdate">EXECUTE [dbo].[PackageControlStop] @PackageId = ?, @ExecutionId = ?, @InsertRowQuantity = ?, @UpdateRowQuantity = ?, @UnchangedRowQuantity=?;</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountChanged">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountNew">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountSource">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountUnchanged">0</Variable>   
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="SchemaName"><#=DstSchemaName#></Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="TableName"><#=DstTableName#></Variable>  
   </Variables>  
   <Tasks>    
     <Dataflow Name="DFT_Insert<#=DstTableName#>">  
       <Transformations>  
         <RowCount Name="CNT_Changed_Rows" VariableName="User.RowCountChanged">  
           <InputPath OutputPathName="CSPL Check For Changes.ChangedRows" />  
     </RowCount>  
         <ConditionalSplit Name="CSPL Check For Changes">  
           <InputPath OutputPathName="LKP Historical Key.Match" />  
           <OutputPaths>  
             <OutputPath Name="ChangedRows">  
               <Expression>(ChangeHashKeyASCII != lkp_ChangeHashKeyASCII) || (UpdateHashKeyASCII != lkp_UpdateHashKeyASCII)</Expression>  
     </OutputPath>  
   </OutputPaths>  
     </ConditionalSplit>  
         <RowCount Name="CNT_New_Rows" VariableName="User.RowCountNew">  
           <InputPath OutputPathName="LKP Historical Key.NoMatch" />  
     </RowCount>  
       <DerivedColumns Name="Identify Change Types">  
           <InputPath SsisName="Derived Column Input" OutputPathName="CNT_Changed_Rows.Output" />  
           <Columns>  
             <Column Name="UpdateInPlace" DataType="Boolean">UpdateHashKeyASCII != lkp_UpdateHashKeyASCII ? (DT_BOOL)1 : (DT_BOOL)0</Column>  
             <Column Name="UpdateNewRow" DataType="Boolean">ChangeHashKeyASCII != lkp_ChangeHashKeyASCII ? (DT_BOOL)1 : (DT_BOOL)0</Column>  
   </Columns>  
     </DerivedColumns>  
         <Lookup Name="LKP Historical Key" NoMatchBehavior="RedirectRowsToNoMatchOutput" OleDbConnectionName="<#=DstConnection#>">  
             <DirectInput>SELECT  
   CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  ,CONVERT(VARCHAR(34), UpdateHashKey, 1) as UpdateHashkeyAscii  
 FROM  
  <#=DstSchemaName#>.<#=DstTableName#>  
     where rowiscurrent = 1  
  </DirectInput> 
           <Parameters>  
             <Parameter SourceColumn="HistoricalHashKeyASCII" />  
   </Parameters>  
            <ParameterizedQuery>select * from (SELECT  
  CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  ,CONVERT(VARCHAR(34), UpdateHashKey, 1) as UpdateHashkeyAscii  
 FROM  
  <#=DstSchemaName#>.<#=DstTableName#>  
     where rowiscurrent = 1) [refTable]  
 where [refTable].[HistoricalHashKeyASCII] = ?</ParameterizedQuery>  
            <InputPath OutputPathName="CNT_Source_Rows.Output" />  
           <Inputs>  
             <Column SourceColumn="HistoricalHashKeyASCII" TargetColumn="HistoricalHashKeyASCII" />  
   </Inputs>  
           <Outputs>  
             <Column SourceColumn="ChangeHashKeyASCII" TargetColumn="lkp_ChangeHashKeyASCII" />  
                   <Column SourceColumn="UpdateHashkeyAscii" TargetColumn="lkp_UpdateHashKeyASCII" />  
   </Outputs>  
     </Lookup>  
         <DerivedColumns Name="NewItemRowStartDate">  
        <InputPath OutputPathName="CNT_New_Rows.Output" />  
          <Columns>  
         <Column ReplaceExisting="true" Name="RowStartDate" DataType="DateTime">  
          (DT_DBTIMESTAMP)"01/01/1900 00:00:00"  
         </Column>  
        </Columns>  
       </DerivedColumns>  
            <OleDbDestination Name="OLE_DST_New_Rows" ConnectionName="<#=DstConnection#>">  
           <InputPath OutputPathName="NewItemRowStartDate.Output" />  
            <ExternalTableOutput Table="<#=DstSchemaName#>.<#=DstTableName#>" />   
                <Columns>  
                   <Column SourceColumn="RowEndDate" IsUsed="false"/>  
                </Columns>  
     </OleDbDestination>  
         <RowCount Name="CNT_Source_Rows" VariableName="User.RowCountSource">  
           <InputPath OutputPathName="<#=DataFlowSourceName#>.Output" />  
     </RowCount>  
         <OleDbSource Name="<#=DataFlowSourceName#>" ConnectionName="<#=SrcConnection#>"> 
                <DirectInput><#=SourceQuery#></DirectInput>  
     </OleDbSource>  
         <RowCount Name="CNT_Unchanged_Rows" VariableName="User.RowCountUnchanged">  
           <InputPath OutputPathName="CSPL Check For Changes.Default" />  
     </RowCount>  
         <OleDbDestination Name="OLE_DST Update Table" ConnectionName="<#=DstConnection#>">   
           <InputPath OutputPathName="Identify Change Types.Output" />  
            <ExternalTableOutput Table="[<#=UpdateSchemaName#>].[<#=UpdateTableName#>]" />  
     </OleDbDestination>  
   </Transformations>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Truncate <#=UpdateTableName#>.Output" />   
           </Inputs>  
        </PrecedenceConstraints>  
     </Dataflow>  
     <ExecuteSQL Name="SQL Begin Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditStart" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageName" DataType="String" Length="-1" />  
         <Parameter Name="1" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="2" VariableName="ParentPackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="3" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="4" VariableName="System.StartTime" DataType="Date" Length="-1" />  
   </Parameters>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Close Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditUpdate" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="1" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="2" VariableName="User.RowCountNew" DataType="Int32" Length="-1" />  
         <Parameter Name="3" VariableName="User.RowCountChanged" DataType="Int32" Length="-1" />  
         <Parameter Name="4" VariableName="User.RowCountUnchanged" DataType="Int32" Length="-1" />  
   </Parameters>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Update <#=DstTableName#>.Output" />  
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>    
     <ExecuteSQL Name="SQL Truncate <#=UpdateTableName#>" ConnectionName="<#=UpdateConnection#>">  
       <DirectInput>Truncate Table [<#=UpdateSchemaName#>].[<#=UpdateTableName#>]</DirectInput>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Begin Audit.Output" />  
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Update <#=DstTableName#>" ConnectionName="<#=DstConnection#>">  
          <DirectInput><#=UpdateSQLStatement#></DirectInput>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="DFT_Insert<#=DstTableName#>.Output" /> 
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>  
   </Tasks>  
 </Package>  

Create Dim2Hybrid.Biml

 

 <#@ template language="C#" hostspecific="true" #>  
 <Biml xmlns="http://schemas.varigence.com/biml.xsd">  
 <#@ include file="ProjectConnection.biml" #>  
 <!--  
    <#  
    string PackageName = "LoadDimRoute";  
    string DstSchemaName = "dbo";  
   string DstTableName = "DimRoute";  
    string DstConnection = "AWBIML";     
    string DataFlowSourceName = "OLE_SRC_StgDimRoute";  
    string SrcConnection = "AWBIML";  
    string SourceQuery = @"Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse,   
   RouteArea, RowIsCurrent   
  , HistoricalHashKey, ChangeHashKey, UpdateHashKey   
  , HistoricalHashKeyASCII   
  , ChangeHashKeyASCII   
  , UpdateHashKeyASCII   
  , RowEndDate  
  , RowStartDate    
  FROM [dbo].[StgDimRoute];" ;  
    string UpdateSchemaName = "Updt";     
    string UpdateTableName = "UpdtRoute";  
    string UpdateConnection = "AWBIML";     
 string UpdateSQLStatement = @" UPDATE A    
  SET A.RouteArea = UA.RouteArea,    
    A.UpdateHashKey = UA.UpdateHashKey,    
    A.UpdtDtTm = CURRENT_TIMESTAMP    
  FROM Updt.UpdtRoute AS UA    
    JOIN Dbo.DimRoute AS A    
    ON UA.HistoricalHashKey = A.HistoricalHashKey    
  WHERE UpdateInPlace = 1;   
  UPDATE A    
  SET RowIsCurrent = 0,    
    A.RowEndDate = UA.RowEndDate,    
    A.UpdtDtTm = CURRENT_TIMESTAMP    
  FROM Updt.UpdtRoute UA    
    JOIN Dbo.DimRoute AS A    
    ON UA.HistoricalHashKey = A.HistoricalHashKey    
  WHERE A.RowIsCurrent = 1    
    AND UA.UpdateNewRow = 1;    
  INSERT INTO Dbo.DimRoute    
     (RouteID,    
     RouteDescription,    
     Supervisor,    
     RouteType,    
     Warehouse,    
     RouteArea,    
     RowIsCurrent,    
     RowStartDate,    
     RowEndDate,    
     HistoricalHashKey,    
     ChangeHashKey,    
     UpdateHashKey)    
  SELECT RouteID,    
     RouteDescription,    
     Supervisor,    
     RouteType,    
     Warehouse,    
     RouteArea,    
     1,    
     RowStartDate,    
     null,    
     HistoricalHashKey,    
     ChangeHashKey,    
     UpdateHashKey  
  FROM Updt.UpdtRoute WHERE UpdateNewRow = 1;" ;     
 #>  
 -->  
   <Packages>  
     <#=CallBimlScript("Dim2Hybrid.biml", PackageName, DstSchemaName, DstTableName, DstConnection, DataFlowSourceName, SrcConnection, SourceQuery, UpdateSchemaName, UpdateTableName, UpdateConnection, UpdateSQLStatement)#>  
   </Packages>  
 </Biml>  

Once I have my source view, dimension table, and update table in the database, the 3 Biml files added to my project, and BIDSHelper installed, all I have to do is right click on the CreateDim2Hybrid.Biml file and choose Generate SSIS packages to create my package.

BIDS Helper, Biml, Data Warehousing, Microsoft Technologies, SSIS, T-SQL

Biml for a Type 1 Slowly Changing Dimension

I’ve been working on building my Biml library over the last few months. One of the first design patterns I created was a Type 1 Slowly Changing Dimension where all fields except the key fields that define the level of granularity are overwritten with updated values.  It assumes I have a staging table, but it could probably be easily modified to pull directly from the source table if needed.

For simplicity, I’m going to create DimSalesReason from the AdventureWorks database as my example. Please pardon the large amount of code in this post. I wanted to provide a fairly complete picture so you can see how the pieces fit together so I’m providing all of the SQL and Biml needed to generate the package.

Required Database Objects

My staging table is a copy of source table and looks like this:

CREATE TABLE [Staging].[SalesReason](
    [SalesReasonID] [int] NOT NULL,
    [Name] [nvarchar](50) NOT NULL,
    [ReasonType] [nvarchar](50) NOT NULL,
    [ModifiedDate] [datetime] NOT NULL
)

Here is the DDL for my dimension table:

CREATE TABLE [dbo].[DimSalesReason](
    [SalesReasonKey] int IDENTITY(1,1) NOT NULL Primary Key
    ,[SalesReasonID] int not null
    ,[SalesReasonName] nvarchar(50) not null
    ,[SalesReasonType] nvarchar(50) not null
    ,[HistoricalHashKey] varbinary(20) NOT NULL
    ,[ChangeHashKey] varbinary(20) NOT NULL
    ,[InsertDttm] datetime not null
    ,[UpdateDttm] datetime NULL
)

You’ll notice I have some audit fields in my table. UpdateDttm is the datetime when the row was last updated. InsertDttm is the datetime when the row was initially inserted. The HistoricalHashKey and ChangeHashKey are used for change detection. I may not need them in a dimension this simple, but I use them in larger/wider dimensions and I like my dimensions to be built consistently (unless they need to be tweaked for performance). The HistoricalHashKey represents the business key of the table that defines the level of uniqueness for the row in the dimension table. The ChangeHashKey represents all the other fields that are used in the dimension row. This makes it easier to determine if values have changed since I can compare just the ChangeHashKey instead of each individual field.

I also have an update table. This allows me to update in bulk rather than updating in place row by row. With a very small table, you might not notice a performance difference, but this pattern becomes much more efficient as the table grows.

CREATE TABLE [Updt].[UpdtSalesReason](
	 [SalesReasonID] int not null
	,[SalesReasonName] nvarchar(50) not null
	,[SalesReasonType] nvarchar(50) not null
	,[HistoricalHashKey] varbinary(20) NOT NULL
	,[ChangeHashKey] varbinary(20) NOT NULL
	,[InsertDttm] datetime not null
)

I use a view to do most of the transformation work (joins, business logic, etc.) for my Type 1 SCD package. Here’s my view for the SalesReasonDimension:

Create View [Staging].[DimSalesReason] as
With SalesReasonData as (
	Select	 [SalesReasonID] 
		,[Name] as [SalesReasonName] 
		,[ReasonType] as [SalesReasonType] 
		,CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(SalesReasonID, ' '))) 
                                                               AS [HistoricalHashKey]
		,CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(Name, ' ', ReasonType))) 
                                                                   AS [ChangeHashKey]
		,CURRENT_TIMESTAMP as InsertDttm
		,CURRENT_TIMESTAMP as UpdtDttm
		from 
		[Staging].[SalesReason]
	UNION 
	Select	-1 as [SalesReasonID]
	   , 'Unknown' as [SalesReasonName]
	   , 'Unknown' as [SalesReasonType]
	   ,CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(-1, ' '))) 
                                                               AS [HistoricalHashKey]
	   ,CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT('Unknown', ' ', 'Unknown')))
                                                               AS [ChangeHashKey]
	   ,CURRENT_TIMESTAMP as InsertDttm
           ,CURRENT_TIMESTAMP as UpdtDttm
)
Select SalesReasonID
	   , SalesReasonName
	   , SalesReasonType
	   , HistoricalHashKey
	   , ChangeHashKey
	   , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII
	   , CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII 
	   , InsertDttm
	   , UpdtDttm
from SalesReasonData

Using a view to do the transformation work allows me to use the SQL Server engine to do the things it is good at (joins, case statements, conversions) and SSIS to do the things it is good at (controlling the flow of data). It also makes it easy to create a design pattern in Biml that is abstract enough that it easily fits most SCD 1 scenarios. The hashkey fields allow me to do a lookup on one field to determine if the row already exists in the dimension table and a quick comparison to determine if any of the values for that row have changed. I chose to use an MD5 hash because I think it is a good balance of speed/size and collision risk. You may feel differently (others prefer SHA1 due to lower collision risk). On a dimension this size, the difference probably isn’t noticeable.

You can also see that I add my unknown member row in my view. Some people prefer to set the surrogate key of the unknown value to -1. I prefer to set the business key to -1 and let the surrogate key be set to any value in the load process. I’m not a fan of having to turn off the identity insert to add the unknown row. When I do the dimension key lookup for fact tables, I just look for the row where the business key = -1. This also means I don’t have to check my dimension tables to see if someone remembered to insert the unknown rows after deployment to a new environment because I know the unknown rows will be inserted when the package is run.

I have audit steps in my package that write to the Audit.PackageControl table using stored procedures.

CREATE TABLE [Audit].[Package_Control](
	[Package_NM] [varchar](100) NOT NULL,
	[Package_ID] [uniqueidentifier] NOT NULL,
	[Parent_Package_ID] [uniqueidentifier] NULL,
	[Execution_ID] [bigint] NULL,
	[Start_TS] [datetime] NOT NULL,
	[Stop_TS] [datetime] NULL,
	[Insert_Row_QT] [int] NULL,
	[Update_Row_QT] [int] NULL,
	[Unchanged_Row_QT] [int] NULL,
	[Deleted_Row_QT] [int] NULL,
	[Duration_s]  AS (datediff(second,[Start_TS],[Stop_TS])),
	[PackageLogID] [int] IDENTITY(1,1) NOT NULL
)

CREATE PROCEDURE [Audit].[PackageControlStart]
(
    @PackageName varchar(100)
,   @PackageId uniqueidentifier
,   @ParentPackageId uniqueidentifier = NULL
,   @ExecutionId bigint
,   @StartTime DATETIME
,   @StopTime datetime = NULL
,   @InsertRowQuantity int = NULL
,   @UpdateRowQuantity int = NULL
,   @UnchangedRowQuantity int = NULL
)
AS
BEGIN
    SET NOCOUNT ON;
	DECLARE @PackageLogId int 
    INSERT INTO [Audit].[Package_Control]
    (
        [Package_NM]
    ,   [Package_ID]
    ,   [Parent_Package_ID]
    ,   [Execution_ID]
    ,   [Start_TS]
    ,   [Stop_TS]
    ,   [Insert_Row_QT]
    ,   [Update_Row_QT]
    ,	[Unchanged_Row_QT]
    )
    SELECT
        @PackageName 
    ,   @PackageId 
    ,   @ParentPackageId 
    ,   @ExecutionId 
    ,   CURRENT_TIMESTAMP
    ,   @StopTime 
    ,   @InsertRowQuantity 
    ,   @UpdateRowQuantity 
    ,	@UnchangedRowQuantity
  SELECT @PackageLogID = SCOPE_IDENTITY()
  SELECT  @PackageLogID as PackageLogID
END

CREATE PROCEDURE [Audit].[PackageControlStop]
(
    @PackageId uniqueidentifier
,   @ExecutionId bigint
,   @InsertRowQuantity int = NULL
,   @UpdateRowQuantity int = NULL
,	@UnchangedRowQuantity int = NULL
)
AS
BEGIN
    SET NOCOUNT ON;
    -- Close out the execution.
    UPDATE PC
    SET [Stop_TS] = CURRENT_TIMESTAMP  
    ,   [Insert_Row_QT] = @InsertRowQuantity
    ,   [Update_Row_QT] = @UpdateRowQuantity
	,	[Unchanged_Row_QT] = @UnchangedRowQuantity
    FROM  [Audit].[Package_Control] AS PC
    WHERE PC.Package_ID = @PackageId
        AND PC.Execution_ID = @ExecutionId
        AND PC.[Stop_TS] IS NULL;  
END

And Now for the Biml

My Biml library usually contains 3 files for each package type:

  • ProjectConnections – I like to keep my connection managers in a separate file so I only have to update one place if I need to add or change a connection.
  • Dim1 – This contains my actual design pattern.
  • CreateDim1 – This is the Biml file I run to generate the package. It gets separated so I can pull values from databases and pass in variables to my design pattern using BimlScript. For this example I have hardcoded my variables into this file rather than pulling from a database.

ProjectConnections.Biml

<#@ template language="C#" tier="1" #>	
<Connections>
    <OleDbConnection Name="AWBIML" ConnectionString ="Data Source=localhost\SQL2014;
        Initial Catalog=AWBIML;Integrated Security=SSPI;Provider=SQLNCLI11.1;" CreateInProject="true"/>
    <OleDbConnection Name="Audit" ConnectionString ="Data Source=localhost\SQL2014;
        Initial Catalog=AWBIML;Integrated Security=SSPI;Provider=SQLNCLI11.1;" CreateInProject="true"/>	
</Connections>

The ProjectConnections file doesn’t begin with Biml tags because it gets included in the middle of the CreateDim1.biml file. AWBIML is a local database I created to contain my data mart.

Dim1.Biml

<#@ template language="C#" tier="2" #>

<#@ property name="PackageName" type="String" #>
<#@ property name="DstSchemaName" type="String" #>
<#@ property name="DstTableName" type="String" #>
<#@ property name="DstConnection" type="String" #>
<#@ property name="DataFlowSourceName" type="String" #>
<#@ property name="SrcConnection" type="String" #>
<#@ property name="SourceQuery" type="String" #>	
<#@ property name="UpdateSchemaName" type="String" #>	
<#@ property name="UpdateTableName" type="String" #>	
<#@ property name="UpdateConnection" type="String" #>		
<#@ property name="UpdateSQLStatement" type="String" #>	
	
<Package Name="<#=PackageName#>" Language="None">
    <Parameters>
        <Parameter DataType="String" Name="ParentPackageID">00000000-0000-0000-0000-000000000000</Parameter>
    </Parameters>
    <Variables>
        <Variable EvaluateAsExpression="true" DataType="String"  
          Name="QualifiedTableSchema">"["+@[User::SchemaName]+"].["+@[User::TableName]+"]"
        </Variable>
        <Variable DataType="String" Name="QueryAuditStart">
          EXECUTE [Audit].[PackageControlStart] @PackageName=?, @PackageId=?, 
          @ParentPackageId=?,@ExecutionId=?, @StartTime=?;
        </Variable>
        <Variable DataType="String" Name="QueryAuditUpdate">
          EXECUTE [Audit].[PackageControlStop] @PackageId=?, @ExecutionId=?, 
          @InsertRowQuantity=?, @UpdateRowQuantity=?, @UnchangedRowQuantity=?;</Variable> 
        <Variable DataType="Int32"  Name="RowCountChanged">0</Variable>
        <Variable DataType="Int32"  Name="RowCountNew">0</Variable>
        <Variable DataType="Int32"  Name="RowCountSource">0</Variable>
        <Variable DataType="Int32"  Name="RowCountUnchanged">0</Variable>
        <Variable DataType="String" Name="SchemaName"><#=DstSchemaName#></Variable>
        <Variable DataType="String" Name="TableName"><#=DstTableName#></Variable>
    </Variables>
    
    <Tasks>
      <ExecuteSQL Name="SQL Begin Audit" ConnectionName="Audit">
       <VariableInput VariableName="User.QueryAuditStart" />
        <Parameters>
          <Parameter Name="0" VariableName="System.PackageName" DataType="String" Length="-1"/>
          <Parameter Name="1" VariableName="System.PackageID" DataType="Guid" Length="-1"/>
          <Parameter Name="2" VariableName="ParentPackageID" DataType="Guid" Length="-1"/>
          <Parameter Name="3" VariableName="System.ServerExecutionID" DataType="Int64" 
            Length="-1"/>
          <Parameter Name="4" VariableName="System.StartTime" DataType="Date" Length="-1"/>
        </Parameters>
      </ExecuteSQL>

        <ExecuteSQL Name="SQL Truncate <#=UpdateTableName#>" 
          ConnectionName="<#=UpdateConnection#>"> 
            <DirectInput>Truncate Table [<#=UpdateSchemaName#>].[<#=UpdateTableName#>]
            </DirectInput>
          <PrecedenceConstraints>
            <Inputs>
              <Input OutputPathName="SQL Begin Audit.Output" />
            </Inputs>
          </PrecedenceConstraints>
        </ExecuteSQL>

        <Dataflow Name="DFT Insert<#=DstTableName#>">
            
            <Transformations>
            
                <RowCount Name="CNT Changed_Rows" VariableName="User.RowCountChanged">
                    <InputPath OutputPathName="CSPL Check For Changes.ChangedRows" />
                </RowCount>
                
                <ConditionalSplit Name="CSPL Check For Changes">
                    <InputPath OutputPathName="LKP Historical Key.Match" />
                    <OutputPaths>
                        <OutputPath Name="ChangedRows">
                          <Expression>ChangeHashKeyASCII != lkp_ChangeHashKeyASCII</Expression>
                        </OutputPath>
                    </OutputPaths>
                </ConditionalSplit>
                
                <RowCount Name="CNT New_Rows" VariableName="User.RowCountNew">
                    <InputPath OutputPathName="LKP Historical Key.NoMatch" />
                </RowCount>
        
                <Lookup Name="LKP Historical Key" NoMatchBehavior="RedirectRowsToNoMatchOutput" 
                  OleDbConnectionName="<#=DstConnection#>">
                    <DirectInput>SELECT
                      CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII
                      , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII
                      FROM
                      <#=DstSchemaName#>.<#=DstTableName#></DirectInput>
                   <Parameters>
                     <Parameter SourceColumn="HistoricalHashKeyASCII" />
                   </Parameters>
	           <ParameterizedQuery>select * from (SELECT
                     CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII
                     , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII
                     FROM
                     <#=DstSchemaName#>.<#=DstTableName#>) [refTable]
                     where [refTable].[HistoricalHashKeyASCII] = ?</ParameterizedQuery>
                  <InputPath OutputPathName="CNT Source_Rows.Output" />
                  <Inputs>
                    <Column SourceColumn="HistoricalHashKeyASCII" 
                      TargetColumn="HistoricalHashKeyASCII" />
                  </Inputs>
                  <Outputs>
                    <Column SourceColumn="ChangeHashKeyASCII" 
                    TargetColumn="lkp_ChangeHashKeyASCII" />
                  </Outputs>
                </Lookup>
       
            <OleDbDestination Name="OLE_DST New_Rows" ConnectionName="<#=DstConnection#>">
              <InputPath OutputPathName="CNT New_Rows.Output" />
	      <ExternalTableOutput Table="<#=DstSchemaName#>.<#=DstTableName#>" /> 
            </OleDbDestination>
            
            <RowCount Name="CNT Source_Rows" VariableName="User.RowCountSource">
              <InputPath OutputPathName="<#=DataFlowSourceName#>.Output" />
            </RowCount>
            
            <OleDbSource Name="<#=DataFlowSourceName#>" ConnectionName="<#=SrcConnection#>">
	      <DirectInput><#=SourceQuery#></DirectInput>
            </OleDbSource>
               
            <RowCount Name="CNT Unchanged_Rows" VariableName="User.RowCountUnchanged">
              <InputPath OutputPathName="CSPL Check For Changes.Default" />
            </RowCount>
                
            <OleDbDestination Name="OLE_DST Update Table" ConnectionName="<#=DstConnection#>"> 
              <InputPath OutputPathName="CNT Changed_Rows.Output" />
	      <ExternalTableOutput Table="[<#=UpdateSchemaName#>].[<#=UpdateTableName#>]" />
            </OleDbDestination>
    
     </Transformations>
    
     <PrecedenceConstraints>
      <Inputs>
          <Input OutputPathName="SQL Truncate <#=UpdateTableName#>.Output" /> 
      </Inputs>
     </PrecedenceConstraints>
    </Dataflow>
            
    <ExecuteSQL Name="SQL Update <#=DstTableName#>" 
    ConnectionName="<#=DstConnection#>">
      <DirectInput><#=UpdateSQLStatement#></DirectInput>
        <PrecedenceConstraints>
          <Inputs>
            <Input OutputPathName="DFT Insert<#=DstTableName#>.Output" /> 
          </Inputs>
        </PrecedenceConstraints>
    </ExecuteSQL>
    
    <ExecuteSQL Name="SQL Close Audit" ConnectionName="Audit">
      <VariableInput VariableName="User.QueryAuditUpdate" />
        <Parameters>
          <Parameter Name="0" VariableName="System.PackageID" DataType="Guid" Length="-1"/>
          <Parameter Name="1" VariableName="System.ServerExecutionID" DataType="Int64"
            Length="-1"/>
          <Parameter Name="2" VariableName="User.RowCountNew" DataType="Int32" Length="-1"/>
          <Parameter Name="3" VariableName="User.RowCountChanged" DataType="Int32" 
            Length="-1"/>
          <Parameter Name="4" VariableName="User.RowCountUnchanged" DataType="Int32" 
            Length="-1"/>
        </Parameters>
        <PrecedenceConstraints>
          <Inputs>
            <Input OutputPathName="SQL Update <#=DstTableName#>.Output" /> 
          </Inputs>
        </PrecedenceConstraints>
    </ExecuteSQL>
				    
  </Tasks>

</Package>

CreateDim1.Biml

<#@ template language="C#" hostspecific="true" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#@ include file="ProjectConnection.biml" #>

	<!--
	<#
	string PackageName  = "LoadDimSalesReason";
	string DstSchemaName = "dbo";
	string DstTableName = "DimSalesReason";
	string DstConnection = "AWBIML";
	string DataFlowSourceName = "OLE_SRC StgDimSalesReason";
	string SrcConnection = "AWBIML";
	string SourceQuery = @"SELECT [SalesReasonID]
					  ,[SalesReasonName]
					  ,[SalesReasonType]
					  ,[HistoricalHashKey]
					  ,[ChangeHashKey]
					  ,[HistoricalHashKeyASCII]
					  ,[ChangeHashKeyASCII]
					  ,[InsertDttm]
					  ,[UpdtDttm]
				  FROM [AWBIML].[Staging].[DimSalesReason]";
	string UpdateSchemaName = "Updt";
	string UpdateTableName = "UpdtSalesReason";
	string UpdateConnection = "AWBIML";
	string UpdateSQLStatement = @"Update d 
					set 
					d.SalesReasonName = u.SalesReasonName,
					d.SalesReasonType = u.SalesReasonType,
					d.changehashkey = u.changehashkey 
					FROM Updt.UpdtSalesReason u 
					inner join dbo.DimSalesReason d
                                        on u.HistoricalHashKey = d.HistoricalHashKey;";
	#>
	-->
	
    <Packages>
        <#=CallBimlScript("Dim1.biml", PackageName, DstSchemaName, DstTableName, DstConnection,
        DataFlowSourceName, SrcConnection, SourceQuery, UpdateSchemaName, UpdateTableName, 
        UpdateConnection, UpdateSQLStatement)#>
    </Packages>
	
</Biml>

Resulting SSIS Package

Executing the CreateDim1.Biml file generates a package called LoadDimSalesReason.dtsx. Here is the control flow:
BIML Dim1 ControlFlow
I start the package by logging the package start in my PackageControl table. Then I truncate my update table to prepare for the new data load. Next I have a data flow task, which inserts data into either the dbo.DimSalesReason table or the Updt.UpdtSalesReason table (see below). The SQL Update task updates the DimSalesReason table with the rows that were inserted into UpdtSalesReason. And finally, I log my package completion.

Here’s my data flow:
BIML Dim1 DataFlow
First, I retrieve my data from my staging view. Then I count the number of rows coming from my source so I can log it in my PackageControl table. The lookup on the HistoricalHashKey field sends rows with no match to be inserted into the dimension table. If the rows have a match, they are checked to see if their ChangeHashKey values match the value of the ChangeHaskey in the existing row in the dimension table. If ChangeHashKey values match, the row is counted and nothing else is done. If the ChangeHashKey values don’t match, the row is counted and written to the update table.

And that is all you need to create a Type 1 dimension load with Biml.

Books, Data Visualization, Data Warehousing, Excel, PerformancePoint, SSAS, SSIS, Uncategorized

The One Book

I tend to get some variation of the following question as I present at SQL Saturdays and work with clients:

What is the one book I should read to gain a good understanding of this topic?

There are many great books out there on business intelligence, data warehousing, and the Microsoft BI stack.  Here is my list of those that had a significant impact on my learning and career or that I consider the current defining book on the subject.

Data Warehousing/Dimensional Modeling –  The Data Warehouse Toolkit by Ralph Kimball and Margy Ross (link is to the newest edition of the book) defines dimensional modeling, walks through many common data scenarios and explains recommended data modeling and ETL practices. If you are designing, building, or enhancing a dimensional data warehouse, please read this book.

Master Data ManagementMaster Data Management by David Loshin thoroughly explains the resources required, helps clarify goals, and discusses the challenges of implementing a master data management program in your organization. The book also discusses architecture options and provides a roadmap that you can adapt to your project. I actually did an independent study class based upon this book as part of my MBA. I would recommend it to anyone just getting into MDM as a great platform agnostic view of MDM.

Agile Data WarehousingAgile Data Warehousing Project Management: Business Intelligence Systems Using Scrum by Ralph Hughes is an enlightening read about applying the agile process to data warehousing by someone who understands both.  Use this book to help set expectations at the beginning of projects, estimate effort, decide what should be included in an iteration, and provide value early in your project.  It addresses DW enhancements as well as new DW projects.

SSISProfessional Microsoft SQL Server 2012 Integration Services contains great information whether you are just learning SSIS or you just need to understand the new features in 2012. It covers each task and data flow component, offers tips for performance tuning, and provides great examples.  If you haven’t moved up to SQL Server 2012 yet, you will want to learn about the project deployment model, environment variables, and project-level connection managers, which are all covered in this book.

SSAS TabularMicrosoft SQL Server 2012 Analysis Services: The BISM Tabular Model by Marco Russo, Alberto Ferrari, and Chris Webb
There aren’t that many books available on SSAS tabular models since it is fairly new. This one got me through a couple of projects as I learned tabular cubes and DAX.  You will read almost every page while building your first tabular cube.  Then you will return to find helpful tips on writing DAX calculations.

Data Visualization –  Information Dashboard Design by Stephen Few (link is to the newest edition of the book)
This book is what got me interested in data visualization.  The way I design reports and dashboards changed after I read it. I was torn whether to go with this book or Show Me the Numbers, but I think Information Dashboard Design covers the main points of Show Me the Numbers that I most want people to understand (it was also the first of Few’s books that I read). Today’s technology enables us to make reports that are high on shiny and low on actionable information. I see two troubling trends: reports that are just wide tables with several hundred lines that no one can read at one time, and really shiny dashboards with lots of bells and whistles that make them look really cool. This book will tell you there is an appropriate time and place for tables and dashboards.  And while visual appeal is good, it should not be at the expense of effectively communicating the intended message of the information.

Up Next

I just picked up some books that look very promising and may cause me to update my list above:

 

Happy reading!