BIDS Helper, Microsoft Technologies, SSIS

Using a Variable to Populate the Query in a Lookup in SSIS

I encountered a situation on my last SSIS project in which I needed to be able to populate the query in lookup with a where clause that referenced a project parameter. This wasn’t something I had ever needed to do in the past, so I had to do a bit of digging to figure it out. Luckily, I found this post by John Welch, which led me in the right direction. I’m posting my steps in detail here both to help anyone else trying to figure this out and also so I have it for reference should I need to do this again.

The Situation

I had warranty registration data that contained addresses. Earlier in the project, we used Melissa Data to do some address cleansing and to retrieve MSA and county FIPS codes for each address. Melissa Data returns results with a confidence score that reflects how well the address matched. We determined that we needed a confidence score of .8 in order to use the data. The client asked that we make an environment variable for the required confidence score so it could be easily changed without having to re-deploy the SSIS project. Later in the project, we had some warranty registration data for which we needed to look up the MSA. But we only wanted to return the MSA if the confidence score tied to the address was at or above the required confidence score we had set in the project parameter.

I already had a project parameter for the required confidence score.

projectparam

So I made a package-level variable to hold the query for the lookup that referenced the project parameter.  Then I set the lookup property to use that query.

The SSIS Setup

I created a variable in my package to hold the query for the SSIS lookup which referenced the project parameter in the where clause. (The scope was the package and the data type was String.)

SSIS Variable query

I already had my data flow populated with the lookup for MSA. I set it to full cache and entered a query in the connection to initially populate the fields that would be returned (simply my lookup query without the where clause).

Next, I opened the data flow properties, located Expressions and clicked on the ellipses to open the Property Expression  Editor. I found the SQLCommand property for my MSA lookup and set it to my package variable that contained my query.

ssis dft propertiesSince I had BIDS Helper installed, I could see that I had an expression set for the lookup as denoted by the pink glyph.

ssis data flow

 

And that’s all there is to it.

I could have avoided the extra variable and put the lookup query directly into the property expression, but I prefer having the variable to reference instead of digging through the properties.

BIDS Helper, Biml, Data Warehousing, Microsoft Technologies, SQL Server, SSIS

Type 6 or Hybrid Type 2 Slowly Changing Dimension with Biml

In my previous post, I provided the design pattern and Biml for a pure Type 2 Slowly Changing Dimension (SCD). When I say “pure Type 2 SCD”, I mean an ETL process that adds a new row for a change in any field in the dimension and never updates a dimension attribute without creating a new row.  In practice, I tend to create more hybrid Type 2 SCDs where updates to some attributes require a new row and others update the value on the existing rows. A similar pattern that I find I implement more often than a pure Type 2 is a Type 6 SCD. A Type 6 SCD builds on the Type 2 technique by adding current attributes alongside the historical attributes so related measures can be grouped by the historical or current dimension attribute values. The only difference between what I call a hybrid Type 2 and a Type 6 is that in the Type 6, there are no Type 1 attributes in the dimension that do not also have a Type 2 version in the dimension to capture the historical values.

Design Pattern

Control Flow

If you are comfortable with my design pattern for a pure Type 2 SCD in which a change of value in any column causes a new row, this pattern is quite similar. And the control flow is exactly the same. This pattern, as with my pure Type 2, assumes that rows are not deleted from the source system. You could easily modify this to check for deleted rows if needed.

Control Flow for a Hybrid Type 2 or Type 6 Dimension
Control Flow for a Hybrid Type 2 or Type 6 Dimension

The steps in the control flow are:

  1. Log package start.
  2. Truncate the update table.
  3. Execute the data flow.
  4. Execute the update statements to update columns and insert new rows.
  5. Log package end.

The update statements are different in this pattern, and I’ll explain those in detail below.

Data Flow

The data flow looks like a pure Type 2 SCD, with the exception of an added derived column transformation and minor changes to the lookup and conditional split. Again, I use views to integrate the data, apply business logic, and add hashkeys for change detection. Then I use SSIS to perform the mechanics of loading the data.

The steps in this data flow task are:

  1. Retrieve data from my source view.
  2. Count the rows for package logging purposes.
  3. Perform a lookup to see if the entity already exists in the dimension table.
  4. If the entity doesn’t exist at all in the dimension table, it goes into the left path where I count the number of rows, add a derived column that sets the row start date to “01/01/1900 00:00:00”, and then insert the row into the dimension table.
  5. If the entity does exist in the table, I check it for changes.
  6. If there are changes to the entity, I count the number of rows, us a derived column to flag the type(s) of changes to make, and then insert the row into an update table.
  7. Entities with no changes are simply counted for audit purposes.

The Source View

This SSIS pattern requires 3 hashed values for for change detection:

  • HistoricalHashKey: the unique identifier of the entity, the natural key that ties the historical rows together
  • ChangeHashKey: the columns on the dimension that cause a new row to be created and the current row to be expired
  • UpdateHashKey: the columns on the dimension that should be updated in place

In my example view below, the Route ID and Warehouse identify a unique route.  The supervisor, route description and route type are all important attributes of the route.  The route area identifies the metro area in which a route is located. If this should change, there is no need to see previous values; we just want to see the current value on every row.

 CREATE View [dbo].[StgDimRoute] as   
    with [Routebase] as   
    (   SELECT [RouteID]  
    , [RouteDescription]  
    , [Supervisor]  
    , [RouteType]  
    , [Warehouse]  
    , [RouteArea]  
     , 1 as RowIsCurrent     
  FROM Stg.Routes R  
  Left JOIN dbo.StgWarehouse W ON W.WarehouseID = R.WarehouseID  
  UNION   
  select -1 as RouteID, 'Unknown' as RouteDescription, 'Unknown' as Supervisor,   
  'Unknown' as RouteType, 'Unknown' as Warehouse, 'Unknown' as RouteArea, 
  1 as RowIsCurrent 
      ) ,  
 Routedata as   
 (  
 Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse, 
   RouteArea, RowIsCurrent 
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT([RouteID], ' ',Warehouse ))) 
      AS HistoricalHashKey  
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT(Supervisor, [RouteDescription], 
      RouteType))) AS ChangeHashKey  
  , Convert(VARBINARY(20), HASHBYTES('MD5', CONCAT(Area,' '))) AS UpdateHashKey  
  from Routebase sb  
  )   
 Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse,  
  RouteArea, RowIsCurrent  
  , HistoricalHashKey, ChangeHashKey, UpdateHashKey  
  , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  , CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
  , CONVERT(VARCHAR(34), UpdateHashKey, 1) AS UpdateHashKeyASCII  
  , dateadd(MINUTE, -1, current_timestamp) as RowEndDate
  , CURRENT_TIMESTAMP as RowStartDate   
  from Routedata ds  

The RowEndDate value in this view will be used for routes that require a current row to be expired since my pattern is the leave the end date of the current row null.

The Change Detection Lookup

The lookup in my DFT compares the HistoricalHashKeyASCII column from the source view with the varchar version of the HistoricalHashKey from the dimension table and adds two lookup columns: lkp_ChangeHashKeyASCII and lkp_UpdateHashKeyASCII to the data set.

Type 6 historical hash lookup

Rows that do not match are new rows; i.e., that route has never been in the dimension table before.  Rows that do match have a row in the dimension table and will then be evaluated to see if there are any changes in the values for that route.

Derived Column for New Rows

The no match output of the lookup are new rows for routes that are not in the dimension table. Since this is the first row in the table for that route, we want this row to be effective from the beginning of time until the end of time. The beginning of time in this data mart is 01/01/1900. Since the data is loaded multiple times per day, I use a date/time field rather than a date. If you only need the precision of a day, you can cut your row end date/time back to just a date. In my pattern, the current row has a null row end date, but you could easily add a derived column to set the end date to 12/31/9999 if you prefer.

Derived Column for Start Date

Conditional Split for Change Detection

This time, I have to check to see if both the ChangeHashKeyASCII and the UpdateHashKeyASCII match in my conditional split.

Type 6 CSPL

If both hashed values from the source view match the hashed values from the dimension table, then no changes are required and the row is simply counted for audit purposes.

If either hashed value doesn’t match, there is an update to be made.

Derived Column to Identify Change Types

We again compare the UpdateHashKeyASCII value from the source view with that of the dimension. If they don’t match, we set the UpdateInPlace flag to true. If the ChangeHashKeyASCII values don’t match, we set the UpdateNewRow flag to true. If a row has both types of changes, both types of updates will be made.

My update table contains the UpdateInPlace and UpdateNewRow columns, so I can reference these flags in my update statements.

The Update Statements

The update statements in the control flow take the changes from the update table and apply them to the dimension table. Three statements are executed in the Execute SQL Statement labeled SQL Update DimRoute.

 UPDATE A   
 SET  A.RouteArea = UA.RouteArea,   
      A.UpdateHashKey = UA.UpdateHashKey,   
      A.UpdtDtTm = CURRENT_TIMESTAMP   
 FROM  Updt.UpdtRoute AS UA   
     JOIN Dbo.DimRoute AS A   
      ON UA.HistoricalHashKey = A.HistoricalHashKey   
 WHERE UpdateInPlace = 1; 
  
UPDATE A   
 SET  RowIsCurrent = 0,   
     A.RowEndDate = UA.RowEndDate,   
     A.UpdtDtTm = CURRENT_TIMESTAMP   
 FROM  Updt.UpdtRoute UA   
     JOIN Dbo.DimRoute AS A   
      ON UA.HistoricalHashKey = A.HistoricalHashKey   
 WHERE A.RowIsCurrent = 1   
     AND UA.UpdateNewRow = 1;   

 INSERT INTO Dbo.DimRoute   
       (RouteID,   
        RouteDescription,   
        Supervisor,   
        RouteType,   
        Warehouse,   
        RouteArea,   
        RowIsCurrent,   
        RowStartDate,   
        RowEndDate,   
        HistoricalHashKey,   
        ChangeHashKey,   
        UpdateHashKey)   
 SELECT RouteID,   
        RouteDescription,   
        Supervisor,   
        RouteType,   
        Warehouse,   
        RouteArea,   
        1,   
        RowStartDate,   
        null,   
        HistoricalHashKey,   
        ChangeHashKey,   
        UpdateHashKey
FROM Updt.UpdtRoute WHERE UpdateNewRow = 1; 

The first statement updates the values for the columns designated to be updated in place by joining the update table to the dimension table based on the HistoricalHashKey column. This is the same as performing updates in a Type 1 SCD.

The second statement expires all the rows for which a new row will be added. The third statement inserts the new rows with the RowIsCurrent value set to 1 and the RowEndDate set to null.

The Biml

If you are using Biml, you know that you can create a design pattern for this type of dimension load and reuse it across multiple projects. This speeds up development and ensures that your Type 2 Hybrid or Type 6 dimensions are implemented consistently.

As usual, I have 3 Biml files that are used to create the SSIS package:

  • ProjectConnections.biml – contains all the project-level connections for the project
  • Dim2Hybrid.biml – contains the SSIS design pattern with code nuggets that parameterize it to make it reusable
  • CreateDim2HybridPackages.biml – calls Dim2Hybrid.biml and passes along the values to be used for each package

I’ve pasted Dim2Hybrid and CreateDim2Hybrid below.

Dim2Hybrid.Biml

 <#@ template language="C#" tier="2" #>  
 <#@ property name="PackageName" type="String" #>  
 <#@ property name="DstSchemaName" type="String" #>  
 <#@ property name="DstTableName" type="String" #>  
 <#@ property name="DstConnection" type="String" #>  
 <#@ property name="DataFlowSourceName" type="String" #>  
 <#@ property name="SrcConnection" type="String" #>  
 <#@ property name="SourceQuery" type="String" #>     
 <#@ property name="UpdateSchemaName" type="String" #>     
 <#@ property name="UpdateTableName" type="String" #>     
 <#@ property name="UpdateConnection" type="String" #>        
 <#@ property name="UpdateSQLStatement" type="String" #>      
     <Package Name="<#=PackageName#>" Language="None">  
   <Parameters>  
     <Parameter DataType="String" Name="ParentPackageID">00000000-0000-0000-0000-000000000000</Parameter>  
   </Parameters>  
   <Variables>  
     <Variable EvaluateAsExpression="true" DataType="String" IncludeInDebugDump="Exclude" Name="QualifiedTableSchema">"[" + @[User::SchemaName] + "].[" + @[User::TableName] + "]"</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditStart">EXECUTE [dbo].[PackageControlStart] @PackageName = ?, @PackageId = ?, @ParentPackageId = ?, @ExecutionId = ?, @StartTime = ?;</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditUpdate">EXECUTE [dbo].[PackageControlStop] @PackageId = ?, @ExecutionId = ?, @InsertRowQuantity = ?, @UpdateRowQuantity = ?, @UnchangedRowQuantity=?;</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountChanged">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountNew">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountSource">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountUnchanged">0</Variable>   
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="SchemaName"><#=DstSchemaName#></Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="TableName"><#=DstTableName#></Variable>  
   </Variables>  
   <Tasks>    
     <Dataflow Name="DFT_Insert<#=DstTableName#>">  
       <Transformations>  
         <RowCount Name="CNT_Changed_Rows" VariableName="User.RowCountChanged">  
           <InputPath OutputPathName="CSPL Check For Changes.ChangedRows" />  
     </RowCount>  
         <ConditionalSplit Name="CSPL Check For Changes">  
           <InputPath OutputPathName="LKP Historical Key.Match" />  
           <OutputPaths>  
             <OutputPath Name="ChangedRows">  
               <Expression>(ChangeHashKeyASCII != lkp_ChangeHashKeyASCII) || (UpdateHashKeyASCII != lkp_UpdateHashKeyASCII)</Expression>  
     </OutputPath>  
   </OutputPaths>  
     </ConditionalSplit>  
         <RowCount Name="CNT_New_Rows" VariableName="User.RowCountNew">  
           <InputPath OutputPathName="LKP Historical Key.NoMatch" />  
     </RowCount>  
       <DerivedColumns Name="Identify Change Types">  
           <InputPath SsisName="Derived Column Input" OutputPathName="CNT_Changed_Rows.Output" />  
           <Columns>  
             <Column Name="UpdateInPlace" DataType="Boolean">UpdateHashKeyASCII != lkp_UpdateHashKeyASCII ? (DT_BOOL)1 : (DT_BOOL)0</Column>  
             <Column Name="UpdateNewRow" DataType="Boolean">ChangeHashKeyASCII != lkp_ChangeHashKeyASCII ? (DT_BOOL)1 : (DT_BOOL)0</Column>  
   </Columns>  
     </DerivedColumns>  
         <Lookup Name="LKP Historical Key" NoMatchBehavior="RedirectRowsToNoMatchOutput" OleDbConnectionName="<#=DstConnection#>">  
             <DirectInput>SELECT  
   CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  ,CONVERT(VARCHAR(34), UpdateHashKey, 1) as UpdateHashkeyAscii  
 FROM  
  <#=DstSchemaName#>.<#=DstTableName#>  
     where rowiscurrent = 1  
  </DirectInput> 
           <Parameters>  
             <Parameter SourceColumn="HistoricalHashKeyASCII" />  
   </Parameters>  
            <ParameterizedQuery>select * from (SELECT  
  CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  ,CONVERT(VARCHAR(34), UpdateHashKey, 1) as UpdateHashkeyAscii  
 FROM  
  <#=DstSchemaName#>.<#=DstTableName#>  
     where rowiscurrent = 1) [refTable]  
 where [refTable].[HistoricalHashKeyASCII] = ?</ParameterizedQuery>  
            <InputPath OutputPathName="CNT_Source_Rows.Output" />  
           <Inputs>  
             <Column SourceColumn="HistoricalHashKeyASCII" TargetColumn="HistoricalHashKeyASCII" />  
   </Inputs>  
           <Outputs>  
             <Column SourceColumn="ChangeHashKeyASCII" TargetColumn="lkp_ChangeHashKeyASCII" />  
                   <Column SourceColumn="UpdateHashkeyAscii" TargetColumn="lkp_UpdateHashKeyASCII" />  
   </Outputs>  
     </Lookup>  
         <DerivedColumns Name="NewItemRowStartDate">  
        <InputPath OutputPathName="CNT_New_Rows.Output" />  
          <Columns>  
         <Column ReplaceExisting="true" Name="RowStartDate" DataType="DateTime">  
          (DT_DBTIMESTAMP)"01/01/1900 00:00:00"  
         </Column>  
        </Columns>  
       </DerivedColumns>  
            <OleDbDestination Name="OLE_DST_New_Rows" ConnectionName="<#=DstConnection#>">  
           <InputPath OutputPathName="NewItemRowStartDate.Output" />  
            <ExternalTableOutput Table="<#=DstSchemaName#>.<#=DstTableName#>" />   
                <Columns>  
                   <Column SourceColumn="RowEndDate" IsUsed="false"/>  
                </Columns>  
     </OleDbDestination>  
         <RowCount Name="CNT_Source_Rows" VariableName="User.RowCountSource">  
           <InputPath OutputPathName="<#=DataFlowSourceName#>.Output" />  
     </RowCount>  
         <OleDbSource Name="<#=DataFlowSourceName#>" ConnectionName="<#=SrcConnection#>"> 
                <DirectInput><#=SourceQuery#></DirectInput>  
     </OleDbSource>  
         <RowCount Name="CNT_Unchanged_Rows" VariableName="User.RowCountUnchanged">  
           <InputPath OutputPathName="CSPL Check For Changes.Default" />  
     </RowCount>  
         <OleDbDestination Name="OLE_DST Update Table" ConnectionName="<#=DstConnection#>">   
           <InputPath OutputPathName="Identify Change Types.Output" />  
            <ExternalTableOutput Table="[<#=UpdateSchemaName#>].[<#=UpdateTableName#>]" />  
     </OleDbDestination>  
   </Transformations>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Truncate <#=UpdateTableName#>.Output" />   
           </Inputs>  
        </PrecedenceConstraints>  
     </Dataflow>  
     <ExecuteSQL Name="SQL Begin Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditStart" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageName" DataType="String" Length="-1" />  
         <Parameter Name="1" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="2" VariableName="ParentPackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="3" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="4" VariableName="System.StartTime" DataType="Date" Length="-1" />  
   </Parameters>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Close Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditUpdate" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="1" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="2" VariableName="User.RowCountNew" DataType="Int32" Length="-1" />  
         <Parameter Name="3" VariableName="User.RowCountChanged" DataType="Int32" Length="-1" />  
         <Parameter Name="4" VariableName="User.RowCountUnchanged" DataType="Int32" Length="-1" />  
   </Parameters>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Update <#=DstTableName#>.Output" />  
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>    
     <ExecuteSQL Name="SQL Truncate <#=UpdateTableName#>" ConnectionName="<#=UpdateConnection#>">  
       <DirectInput>Truncate Table [<#=UpdateSchemaName#>].[<#=UpdateTableName#>]</DirectInput>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="SQL Begin Audit.Output" />  
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Update <#=DstTableName#>" ConnectionName="<#=DstConnection#>">  
          <DirectInput><#=UpdateSQLStatement#></DirectInput>  
        <PrecedenceConstraints>  
           <Inputs>  
              <Input OutputPathName="DFT_Insert<#=DstTableName#>.Output" /> 
           </Inputs>  
        </PrecedenceConstraints>  
     </ExecuteSQL>  
   </Tasks>  
 </Package>  

Create Dim2Hybrid.Biml

 

 <#@ template language="C#" hostspecific="true" #>  
 <Biml xmlns="http://schemas.varigence.com/biml.xsd">  
 <#@ include file="ProjectConnection.biml" #>  
 <!--  
    <#  
    string PackageName = "LoadDimRoute";  
    string DstSchemaName = "dbo";  
   string DstTableName = "DimRoute";  
    string DstConnection = "AWBIML";     
    string DataFlowSourceName = "OLE_SRC_StgDimRoute";  
    string SrcConnection = "AWBIML";  
    string SourceQuery = @"Select RouteID, RouteDescription, Supervisor, RouteType, Warehouse,   
   RouteArea, RowIsCurrent   
  , HistoricalHashKey, ChangeHashKey, UpdateHashKey   
  , HistoricalHashKeyASCII   
  , ChangeHashKeyASCII   
  , UpdateHashKeyASCII   
  , RowEndDate  
  , RowStartDate    
  FROM [dbo].[StgDimRoute];" ;  
    string UpdateSchemaName = "Updt";     
    string UpdateTableName = "UpdtRoute";  
    string UpdateConnection = "AWBIML";     
 string UpdateSQLStatement = @" UPDATE A    
  SET A.RouteArea = UA.RouteArea,    
    A.UpdateHashKey = UA.UpdateHashKey,    
    A.UpdtDtTm = CURRENT_TIMESTAMP    
  FROM Updt.UpdtRoute AS UA    
    JOIN Dbo.DimRoute AS A    
    ON UA.HistoricalHashKey = A.HistoricalHashKey    
  WHERE UpdateInPlace = 1;   
  UPDATE A    
  SET RowIsCurrent = 0,    
    A.RowEndDate = UA.RowEndDate,    
    A.UpdtDtTm = CURRENT_TIMESTAMP    
  FROM Updt.UpdtRoute UA    
    JOIN Dbo.DimRoute AS A    
    ON UA.HistoricalHashKey = A.HistoricalHashKey    
  WHERE A.RowIsCurrent = 1    
    AND UA.UpdateNewRow = 1;    
  INSERT INTO Dbo.DimRoute    
     (RouteID,    
     RouteDescription,    
     Supervisor,    
     RouteType,    
     Warehouse,    
     RouteArea,    
     RowIsCurrent,    
     RowStartDate,    
     RowEndDate,    
     HistoricalHashKey,    
     ChangeHashKey,    
     UpdateHashKey)    
  SELECT RouteID,    
     RouteDescription,    
     Supervisor,    
     RouteType,    
     Warehouse,    
     RouteArea,    
     1,    
     RowStartDate,    
     null,    
     HistoricalHashKey,    
     ChangeHashKey,    
     UpdateHashKey  
  FROM Updt.UpdtRoute WHERE UpdateNewRow = 1;" ;     
 #>  
 -->  
   <Packages>  
     <#=CallBimlScript("Dim2Hybrid.biml", PackageName, DstSchemaName, DstTableName, DstConnection, DataFlowSourceName, SrcConnection, SourceQuery, UpdateSchemaName, UpdateTableName, UpdateConnection, UpdateSQLStatement)#>  
   </Packages>  
 </Biml>  

Once I have my source view, dimension table, and update table in the database, the 3 Biml files added to my project, and BIDSHelper installed, all I have to do is right click on the CreateDim2Hybrid.Biml file and choose Generate SSIS packages to create my package.

BIDS Helper, Biml, Microsoft Technologies, SSIS

Demystifying the Type 2 Slowly Changing Dimension with Biml

Most data warehouses have at least a couple of Type 2 Slowly Changing Dimensions. We use them to keep history so we can see what an entity looked like at the time an event occurred. From an ETL standpoint, I think Type 2 SCDs are the most commonly over-complicated and under-optimized design pattern I encounter.

There is a Slowly Changing Dimension Transformation built into SSIS, but most people recommend against using it as it isn’t very efficient.  I think many of the people that do use it do so simply because they feel it’s easier than digging in and understanding the operations that need to be done in order to roll your own Type 2 SCD processing.The most common mistake I see in SCD 2 packages, whether using the built-in transformation or creating your own data flow, is that people use OLEDB commands to perform updates one row at a time rather than writing updates to a staging table and performing a set-based update on all rows.  If your dimension is small, the performance from row by row updates may be acceptable, but the overhead associated with using a staging table and performing set-based update will probably be negligible. So why not keep a consistent pattern for all type 2 dimensions and require no changes if the dimension grows?

I am here to encourage you: don’t be scared of slowly changing dimensions. Once you learn the design pattern, you will find they aren’t that difficult to understand. And once you have the Biml, creating them is fairly painless.

My Design Pattern

BimlScript allows me to create a reusable design pattern for SSIS that I can employ for each Type 2 dimension that I create. Despite the need to keep history, my Type 2 SCD doesn’t look that much different from my Type 1 SCD. In fact the control flow is exactly the same.

I start with an audit step to log the beginning of package execution. Then I truncate the staging table used to contain my updates. Next I have the Data Flow Task to insert new rows into the dimension table and updates into my update table. Then I run an Execute SQL Task to perform the updates. Finally I log the end of package execution.

The Data Flow Task looks a like a Type 1 SCD until the end, where I have a Derived Column Transformation before I insert new rows into my dimension table.

As with my Type 1 pattern, the combination and transformation of the data (the business logic) is performed in a view in SQL Server, and the mechanics of capturing the history is performed in SSIS.

The steps in this data flow task are:

  1. Retrieve data from my source view.
  2. Count the rows for package logging purposes.
  3. Perform a lookup to see if the entity already exists in the dimension table (more info on that below).
  4. If the entity doesn’t exist at all in the dimension table, it goes into the left path where I count the number of rows, add a derived column that sets the row start date to “01/01/1900 00:00:00”, and then insert the row into the dimension table.
  5. If the entity does exist in the table, I check it for changes.
  6. If there are changes to the entity, I count the number of rows and then insert the row into an update table.
  7. Entities with no changes are simply counted for audit purposes.

 The Details

Source View with Hashkeys

I’m using product data from the AdventureWorks database for an example.  I create a view based upon my staging table that holds the data from the source system. In addition to providing the data values from the source data, my view does the following:

  • Adds in the unknown row
  • Adds the hashkeys for change detection
  • Sets the row start date
 CREATE VIEW [dbo].[StgProduct2]  
      AS   
with Productbase as (     
 SELECT [productid],   
     [name],   
     [productnumber],   
     [makeflag],   
     [finishedgoodsflag],   
     [color],   
     [safetystocklevel],   
     [reorderpoint],   
     [standardcost],   
     [listprice],   
     [size],   
     [sizeunitmeasurecode],   
     [weightunitmeasurecode],   
     [weight],   
     [daystomanufacture],   
     [productline],   
     [class],   
     [style],   
     [sellstartdate],   
     [sellenddate],   
     [discontinueddate],   
           1 as RowIsCurrent,   
           0 as isDeleted  
  FROM  [Staging].[Product]  
 UNION   
  Select -1 as productid, 'unknown' as name, null as productnumber, null as makeflag, 
   null as finishedgoodsflag, null as color, null as safetystocklevel,   
   null as reorderpoint, null as standardcost, null as listprice, null as size, 
   null as sizeunitmeasurecode, null as weightunitmeasurecode, null as weight,   
   null as daystomanufacture, null as productline, null as class, null as style, 
   null as sellstartdate, null as sellenddate, null as discontinueddate,   
   1 as RowIsCurrent, 0 as isDeleted          
        ) ,  
productdata as   
 (  
 Select [productid], [name], [productnumber], [makeflag], [finishedgoodsflag], [color], 
 [safetystocklevel], [reorderpoint], [standardcost], [listprice], [size], 
 [sizeunitmeasurecode], [weightunitmeasurecode], [weight], [daystomanufacture], 
 [productline], [class], [style], [sellstartdate], [sellenddate], [discontinueddate], 
 rowiscurrent, isdeleted  
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT([ProductID], ' '))) AS HistoricalHashKey  
  , CONVERT(VARBINARY(20), HASHBYTES('MD5', CONCAT([name], [productnumber], [makeflag], 
 [finishedgoodsflag], [color], [safetystocklevel], [reorderpoint], [standardcost], [listprice], 
 [size], [sizeunitmeasurecode], [weightunitmeasurecode], [weight], [daystomanufacture], 
 [productline], [class], [style], [sellstartdate], [sellenddate], [discontinueddate],
 rowiscurrent, isDeleted ))) AS ChangeHashKey  
  from Productbase sb  
  )   
Select [productid], [name], [productnumber], [makeflag], [finishedgoodsflag], [color], 
 [safetystocklevel], [reorderpoint], [standardcost], [listprice], [size], [sizeunitmeasurecode],  
 [weightunitmeasurecode], [weight], [daystomanufacture], [productline], [class], [style],  
 [sellstartdate], [sellenddate], [discontinueddate], rowiscurrent, isdeleted  
  ,HistoricalHashKey, ChangeHashKey  
  , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
  , CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
  , dateadd(MINUTE, -1, current_timestamp) as rowenddate, CURRENT_TIMESTAMP as rowstartdate   
  from productdata ds  

The Change Detection Lookup

The lookup in my DFT compares the HistoricalHashKeyASCII column from the source view with the varchar version of the HistoricalHashKey from the dimension table and adds the lookup column lkp_ChangeHashKeyASCII to the data set.

 

lookup hashkey

Rows that do not match are new rows, as in that product has never been in the dimension table before.  Rows that do match have a row in the dimension table and will then be evaluated to see if there are any changes in the values for that product.

Derived Column for New Rows

The no match output of the lookup are new rows for products that are not in the dimension table. Since this is the first row in the table for that product, we want this row to be effective from the beginning of time until the end of time. The beginning of time in the data warehouse is 01/01/1900. Since I often load the data multiple times per day, I use a date/time field rather than a date. If you only need the precision of a day, you can cut your row end date/time back to just a date. In my pattern, the current row has a null row end date, but you could easily add a derived column to set the end date to 12/31/9999 if you prefer.

Derived Column for Start Date

Conditional Split to Check Existing Products For Changes

For existing products, we only want to update the products for which some attribute has changed. We check for changes using a conditional split. When we performed the lookup against the HistoricalHashKeyASCII value, we added the lookup column lkp_ChangeHashKeyASCII from the dimension table.

The match output from the lookup flows into the conditional split. Then we check to see if there is a difference between the change hash from the source view and the change hash in the dimension table that we have added to our data set.

Conditional Split To Detect Changes

If the change hash values are equal the row hasn’t changed and we can pass it on to get the count of unchanged rows. If the change hash values are not equal, we know some column will have a value to update and we can pass it on to the update staging table.

Update Commands

After the data flow task completes, there is an Execute SQL Task to update the changed rows.  There are two commands in the task:

  1. Set the row end date and row is current flag on all existing rows in the dimension for which there is an update.
  2. Insert the new rows for the changed products.
 UPDATE A   
  SET  RowisCurrent = 0,   
    A.rowenddate = UA.rowenddate,   
    A.updatedttm = CURRENT_TIMESTAMP   
  FROM  [Updt].UpdtProduct2 AS UA   
    JOIN dbo.Product2 AS A   
    ON UA.historicalhashkey = A.historicalhashkey   
  WHERE A.rowiscurrent = 1;   
  
INSERT INTO dbo.Product2   
    ([productid]  
     ,[name]  
     ,[productnumber]  
     ,[makeflag]  
     ,[finishedgoodsflag]  
     ,[color]  
     ,[safetystocklevel]  
     ,[reorderpoint]  
     ,[standardcost]  
     ,[listprice]  
     ,[size]  
     ,[sizeunitmeasurecode]  
     ,[weightunitmeasurecode]  
     ,[weight]  
     ,[daystomanufacture]  
     ,[productline]  
     ,[class]  
     ,[style]  
     ,[sellstartdate]  
     ,[sellenddate]  
     ,[discontinueddate]  
     ,[rowiscurrent]  
     ,[isdeleted]  
     ,[HistoricalHashKey]  
     ,[ChangeHashKey]  
     ,[rowenddate]  
     ,[rowstartdate])   
  SELECT [productid]  
     ,[name]  
     ,[productnumber]  
     ,[makeflag]  
     ,[finishedgoodsflag]  
     ,[color]  
     ,[safetystocklevel]  
     ,[reorderpoint]  
     ,[standardcost]  
     ,[listprice]  
     ,[size]  
     ,[sizeunitmeasurecode]  
     ,[weightunitmeasurecode]  
     ,[weight]  
     ,[daystomanufacture]  
     ,[productline]  
     ,[class]  
     ,[style]  
     ,[sellstartdate]  
     ,[sellenddate]  
     ,[discontinueddate]  
     ,[rowiscurrent]  
     ,[isdeleted]  
     ,[HistoricalHashKey]  
     ,[ChangeHashKey]  
     ,null  
     ,[rowstartdate]   
  FROM  updt.updtProduct2;  

The BimlScript

I have a Dim2.Biml file that just contains the design pattern with code blocks that allow me to pass parameters to it for the particular dimension I’m making. I also have a Biml file that contains the project level connection managers. Then I have a Biml file that obtains the package specific parameters and passes them to the Dim2.Biml file.

ProjectConnections.biml

 <#@ template language="C#" tier="1" #>       
 <Connections>  
  <OleDbConnection Name="AWBIML" ConnectionString ="Data Source=localhost\SQL2014;  
   Initial Catalog=AWBIML;Integrated Security=SSPI;Provider=SQLNCLI11.1;" CreateInProject="true"/>  
  <OleDbConnection Name="Audit" ConnectionString ="Data Source=localhost\SQL2014;  
   Initial Catalog=AWBIML;Integrated Security=SSPI;Provider=SQLNCLI11.1;" CreateInProject="true"/>       
 </Connections>  

The ProjectConnections file doesn’t begin with Biml tags because it gets included in the middle of the CreateDim1.biml file. AWBIML is a local database I created to contain my data mart.

Dim2.biml

This is the file that contains my design pattern.

 <#@ template language="C#" tier="2" #>  
   
 <#@ property name="PackageName" type="String" #>  
 <#@ property name="SourceConnection" type="String" #>  
 <#@ property name="DestinationConnection" type="String" #>  
 <#@ property name="DestinationSchemaName" type="String" #>  
 <#@ property name="DestinationTableName" type="String" #>  
 <#@ property name="DestinationTable" type="String" #>  
 <#@ property name="DataFlowName" type="String" #>  
 <#@ property name="DataFlowSourceName" type="String" #>  
 <#@ property name="DataFlowQuery" type="String" #>       
 <#@ property name="QueryOutputPathName" type="String" #>  
 <#@ property name="DestinationName" type="String" #>  
 <#@ property name="UpdateSchemaName" type="String" #>       
 <#@ property name="UpdateTableName" type="String" #>       
 <#@ property name="UpdateConnection" type="String" #>            
 <#@ property name="UpdateSQLStatement" type="String" #>       
        
        
   <Package Name="<#=PackageName#>" Language="None">  
   <Parameters>  
     <Parameter DataType="String" Name="ParentPackageID">00000000-0000-0000-0000-000000000000</Parameter>  
   </Parameters>  
   <Variables>  
     <Variable EvaluateAsExpression="true" DataType="String" IncludeInDebugDump="Exclude" Name="QualifiedTableSchema">"[" + @[User::SchemaName] + "].[" + @[User::TableName] + "]"</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditStart">EXECUTE [Audit].[PackageControlStart] @PackageName = ?, @PackageId = ?, @ParentPackageId = ?, @ExecutionId = ?, @StartTime = ?;</Variable>  
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="QueryAuditUpdate">EXECUTE [Audit].[PackageControlStop] @PackageId = ?, @ExecutionId = ?, @InsertRowQuantity = ?, @UpdateRowQuantity = ?, @UnchangedRowQuantity=?;</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountChanged">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountNew">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountSource">0</Variable>  
     <Variable DataType="Int32" IncludeInDebugDump="Include" Name="RowCountUnchanged">0</Variable>  
             
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="SchemaName"><#=DestinationSchemaName#></Variable>  
             
     <Variable DataType="String" IncludeInDebugDump="Exclude" Name="TableName"><#=DestinationTableName#></Variable>  
   </Variables>  
   <Tasks>  
             
     <Dataflow Name="DFT_Insert<#=DestinationTableName#>">  
       <Transformations>  
         <RowCount Name="CNT_Changed_Rows" VariableName="User.RowCountChanged">  
           <InputPath OutputPathName="CSPL Check For Changes.ChangedRows" />  
     </RowCount>  
         <ConditionalSplit Name="CSPL Check For Changes">  
           <InputPath OutputPathName="LKP Historical Key.Match" />  
           <OutputPaths>  
             <OutputPath Name="ChangedRows">  
               <Expression>ChangeHashKeyASCII != lkp_ChangeHashKeyASCII</Expression>  
     </OutputPath>  
   </OutputPaths>  
     </ConditionalSplit>  
         <RowCount Name="CNT_New_Rows" VariableName="User.RowCountNew">  
           <InputPath OutputPathName="LKP Historical Key.NoMatch" />  
     </RowCount>  
         <Lookup Name="LKP Historical Key" NoMatchBehavior="RedirectRowsToNoMatchOutput" OleDbConnectionName="<#=DestinationConnection#>">  
               <DirectInput>SELECT  
  CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
 FROM  
  <#=DestinationSchemaName#>.<#=DestinationTableName#>  
  where RowIsCurrent = 1  
  </DirectInput>  
           <Parameters>  
             <Parameter SourceColumn="HistoricalHashKeyASCII" />  
   </Parameters>  
              <ParameterizedQuery>select * from (SELECT  
  CONVERT(VARCHAR(34), ChangeHashKey, 1) AS ChangeHashKeyASCII  
 , CONVERT(VARCHAR(34), HistoricalHashKey, 1) AS HistoricalHashKeyASCII  
 FROM  
  <#=DestinationSchemaName#>.<#=DestinationTableName#>) [refTable]  
 where [refTable].[HistoricalHashKeyASCII] = ?</ParameterizedQuery>  
              <InputPath OutputPathName="CNT_Source_Rows.Output" />  
           <Inputs>  
             <Column SourceColumn="HistoricalHashKeyASCII" TargetColumn="HistoricalHashKeyASCII" />  
   </Inputs>  
           <Outputs>  
             <Column SourceColumn="ChangeHashKeyASCII" TargetColumn="lkp_ChangeHashKeyASCII" />  
   </Outputs>  
     </Lookup>  
                      <DerivedColumns Name="NewItemRowStartDate">  
        <InputPath OutputPathName="CNT_New_Rows.Output" />  
                <Columns>  
         <Column ReplaceExisting="true" Name="RowStartDate" DataType="DateTime">  
          (DT_DBTIMESTAMP)"01/01/1900 00:00:00"  
         </Column>  
        </Columns>  
       </DerivedColumns>  
         <OleDbDestination Name="OLE_DST_New_Rows" ConnectionName="<#=DestinationConnection#>">  
           <InputPath OutputPathName="NewItemRowStartDate.Output" />  
              <ExternalTableOutput Table="<#=DestinationSchemaName#>.<#=DestinationTableName#>" />   
                               <Columns>  
                               <Column SourceColumn="RowEndDate" IsUsed="false"/>  
                          </Columns>  
     </OleDbDestination>  
         <RowCount Name="CNT_Source_Rows" VariableName="User.RowCountSource">  
           <InputPath OutputPathName="<#=DataFlowSourceName#>.Output" />  
     </RowCount>  
         <OleDbSource Name="<#=DataFlowSourceName#>" ConnectionName="<#=SourceConnection#>">        
                          <DirectInput><#=DataFlowQuery#></DirectInput>  
     </OleDbSource>  
         <RowCount Name="CNT_Unchanged_Rows" VariableName="User.RowCountUnchanged">  
           <InputPath OutputPathName="CSPL Check For Changes.Default" />  
     </RowCount>  
         <OleDbDestination Name="OLE_DST Update Table" ConnectionName="<#=DestinationConnection#>">  
           <InputPath OutputPathName="CNT_Changed_Rows.Output" />  
              <ExternalTableOutput Table="[<#=UpdateSchemaName#>].[<#=UpdateTableName#>]" />  
     </OleDbDestination>  
   </Transformations>  
          <PrecedenceConstraints>  
               <Inputs>  
                    <Input OutputPathName="SQL Truncate <#=UpdateTableName#>.Output" />   
               </Inputs>  
          </PrecedenceConstraints>  
     </Dataflow>  
     <ExecuteSQL Name="SQL Begin Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditStart" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageName" DataType="String" Length="-1" />  
         <Parameter Name="1" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="2" VariableName="ParentPackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="3" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="4" VariableName="System.StartTime" DataType="Date" Length="-1" />  
   </Parameters>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Close Audit" ConnectionName="Audit">  
       <VariableInput VariableName="User.QueryAuditUpdate" />  
       <Parameters>  
         <Parameter Name="0" VariableName="System.PackageID" DataType="Guid" Length="-1" />  
         <Parameter Name="1" VariableName="System.ServerExecutionID" DataType="Int64" Length="-1" />  
         <Parameter Name="2" VariableName="User.RowCountNew" DataType="Int32" Length="-1" />  
         <Parameter Name="3" VariableName="User.RowCountChanged" DataType="Int32" Length="-1" />  
         <Parameter Name="4" VariableName="User.RowCountUnchanged" DataType="Int32" Length="-1" />  
   </Parameters>  
          <PrecedenceConstraints>  
               <Inputs>  
                    <Input OutputPathName="SQL Update <#=DestinationTableName#>.Output" />   
               </Inputs>  
          </PrecedenceConstraints>  
     </ExecuteSQL>  
     <ExecuteSQL Name="SQL Truncate <#=UpdateTableName#>" ConnectionName="<#=UpdateConnection#>">   
       <DirectInput>Truncate Table [<#=UpdateSchemaName#>].[<#=UpdateTableName#>]</DirectInput>  
          <PrecedenceConstraints>  
               <Inputs>  
                    <Input OutputPathName="SQL Begin Audit.Output" />  
               </Inputs>  
          </PrecedenceConstraints>  
     </ExecuteSQL>  
   
     <ExecuteSQL Name="SQL Update <#=DestinationTableName#>" ConnectionName="<#=DestinationConnection#>">  
                <DirectInput><#=UpdateSQLStatement#></DirectInput>  
          <PrecedenceConstraints>  
               <Inputs>  
                    <Input OutputPathName="DFT_Insert<#=DestinationTableName#>.Output" />   
               </Inputs>  
          </PrecedenceConstraints>  
     </ExecuteSQL>  
   </Tasks>  
 </Package>  
   

CreateDim2Packages.Biml

This file simply feeds values into the code nuggets in the Dim2.biml file. In practice, I store these values in a table and change this file to pull the values from a table so when I generate SSIS packages from this file, all my type 2 dimension packages are generated rather than just one.

 <#@ template language="C#" hostspecific="true" #>  
 <#@ import namespace="System.Data" #>  
 <#@ import namespace="System.Data.SqlClient" #>  
 <#@ import namespace="System.IO" #>  
   
 <Biml xmlns="http://schemas.varigence.com/biml.xsd">  
             
 <!--  
      <#  
      string PackageName = "LoadDimProduct2";  
      string SourceConnection = "AWBIML";  
      string DestinationConnection = "AWBIML";       
      string DestinationSchemaName = "dbo";  
      string DestinationTableName = "Product2";  
      string DestinationTable = "dbo.Product2";  
      string DataFlowName = "DFT_Product2";  
      string DataFlowSourceName = "OLE_SRC_StgProduct";  
      string QueryOutputPathName = "OLE_SRC_StgProduct.Output";  
      string DestinationName = "OLE_DST_Product2";  
      string UpdateSchemaName = "Updt";  
      string UpdateTableName = "UpdtProduct2";  
      string UpdateConnection = "AWBIML";  
      string DataFlowQuery = @"SELECT [productid]  
                                     ,[name]  
                                     ,[productnumber]  
                                     ,[makeflag]  
                                     ,[finishedgoodsflag]  
                                     ,[color]  
                                     ,[safetystocklevel]  
                                     ,[reorderpoint]  
                                     ,[standardcost]  
                                     ,[listprice]  
                                     ,[size]  
                                     ,[sizeunitmeasurecode]  
                                     ,[weightunitmeasurecode]  
                                     ,[weight]  
                                     ,[daystomanufacture]  
                                     ,[productline]  
                                     ,[class]  
                                     ,[style]  
                                     ,[sellstartdate]  
                                     ,[sellenddate]  
                                     ,[discontinueddate]  
                                     ,[rowiscurrent]  
                                     ,[isdeleted]  
                                     ,[HistoricalHashKey]  
                                     ,[ChangeHashKey]  
                                     ,[HistoricalHashKeyASCII]  
                                     ,[ChangeHashKeyASCII]  
                                     ,[rowenddate]  
                                     ,[rowstartdate]  
                                FROM [dbo].[StgProduct2];" ;  
           string UpdateSQLStatement = @"UPDATE A   
           SET  RowisCurrent = 0,   
                  A.rowenddate = UA.rowenddate,   
                  A.updatedttm = CURRENT_TIMESTAMP   
           FROM  [Updt].UpdtProduct2 AS UA   
                  JOIN dbo.Product2 AS A   
                      ON UA.historicalhashkey = A.historicalhashkey   
           WHERE A.rowiscurrent = 1;   
   
           INSERT INTO dbo.Product2   
                          ([productid]  
                       ,[name]  
                       ,[productnumber]  
                       ,[makeflag]  
                       ,[finishedgoodsflag]  
                       ,[color]  
                       ,[safetystocklevel]  
                       ,[reorderpoint]  
                       ,[standardcost]  
                       ,[listprice]  
                       ,[size]  
                       ,[sizeunitmeasurecode]  
                       ,[weightunitmeasurecode]  
                       ,[weight]  
                       ,[daystomanufacture]  
                       ,[productline]  
                       ,[class]  
                       ,[style]  
                       ,[sellstartdate]  
                       ,[sellenddate]  
                       ,[discontinueddate]  
                       ,[rowiscurrent]  
                       ,[isdeleted]  
                       ,[HistoricalHashKey]  
                       ,[ChangeHashKey]  
                       ,[rowenddate]  
                       ,[rowstartdate])   
           SELECT          [productid]  
                       ,[name]  
                       ,[productnumber]  
                       ,[makeflag]  
                       ,[finishedgoodsflag]  
                       ,[color]  
                       ,[safetystocklevel]  
                       ,[reorderpoint]  
                       ,[standardcost]  
                       ,[listprice]  
                       ,[size]  
                       ,[sizeunitmeasurecode]  
                       ,[weightunitmeasurecode]  
                       ,[weight]  
                       ,[daystomanufacture]  
                       ,[productline]  
                       ,[class]  
                       ,[style]  
                       ,[sellstartdate]  
                       ,[sellenddate]  
                       ,[discontinueddate]  
                       ,[rowiscurrent]  
                       ,[isdeleted]  
                       ,[HistoricalHashKey]  
                       ,[ChangeHashKey]  
                       ,null  
                       ,[rowstartdate]   
           FROM  updt.updtProduct2;" ;  
        
      #>  
      -->  
                <#@ include file="ProjectConnection.biml" #>  
                            
                     <Packages>  
                          <#=CallBimlScript("Dim2.biml", PackageName, SourceConnection, DestinationConnection, DestinationSchemaName, DestinationTableName, DestinationTable, DataFlowName, DataFlowSourceName, DataFlowQuery, QueryOutputPathName, DestinationName, UpdateSchemaName, UpdateTableName, UpdateConnection, UpdateSQLStatement)#>  
                     </Packages>  
   
 </Biml>  

Get the Biml

You can download the Biml and SQL files necessary to use this pattern here.  This zip file contains:

  • 3 Biml files (ProjectConnections, Dim2, CreateDim2Packages)
  • 1 Word doc with a brief explanation of the files and design pattern
  • 8 SQL files to make the schemas, tables, views, and stored procedures for the product dimension.

This pattern is for a pure Type 2 SCD. I’ll post again soon with a pattern that handles hybrid Type 2 and Type 6 SCDs.

For more info on Biml, see my other Biml posts or check out BimlScript.com.

Data Visualization, Excel

KCStat Chart Makeover

I live in Kansas City, and I like to be aware of local events and government affairs. KC has a program called KCStat, which monitors the city’s progress toward its 5-year city-wide business plan. As part of this program, the Mayor and City Manager moderate a KCStat meeting each month, and the conversation and data are live tweeted.  I love the program as well as all of the open data available on Data.KCMO.org. But as a data viz enthusiast, I cringe at some of the charts that are tweeted out from KCStat.  I wish that Kansas City would let me fix their charts – or better yet, let me come and teach a data viz class and facilitate a chart makeover session (if anyone from the city of KCMO is reading this, I would gladly donate my time to do this).

This week, a particular chart from KCStat caught my eye.

Story

There is an interesting back story to this chart. KCMO invited citizens to attend Citizen Work Sessions, in which they organized into small groups and played a game to learn about the city budget and provide feedback. In the first scenario of the game, each group was asked to allocate a $100 budget across the city’s five major goals (as laid out in KC’s Citywide Business Plan) without reference to specific programs within the goals. In the second scenario, they were given a total budget of $1000 allocated to match the current city budget and were asked to identify the top 3 – 5 priority programs under each goal and then cut $50 from individual programs (as opposed to just the broad goals).  The chart from the tweet represents the results of the game.

Chart Design

The table shows the starting allocation and the average final allocation after the $50 cut (averaged because multiple groups performed this exercise). The line in the chart is tied to the starting allocation, and there are several markers that show the values assigned by the individual groups  after the budget cut (which would equal the ending final average amount when averaged together).

For me, this chart is ineffective and cluttered.

  • It’s using a line chart for categorical comparisons. With rare exceptions for multivariate analysis, slope charts, and maybe a few other situations, line charts are best used to show trends over time.
  • The markers are all bunched together to the point that they really aren’t useful, especially given the scale of the y-axis and the size of the markers. It would be nice to see the individual markers if there were any dramatic outliers, but that is not the case with this data.
  • The placement of the table in the middle of the chart between the axis labels and the data points is distracting. If you include axis labels rather than directly labeling, it is because you want to allow users to see the scale and follow a label across to a data point. But the line your eyes follow with this chart is interrupted by the big blue table in the middle of the graph.
  • I believe the table column header for “Ending Average Final (after cutting $50m)” is mislabeled. It should be after cutting $50, since we are dealing with only $1000 to begin with. I thought maybe it was accidentally scaled to the city’s real budget amount, but it looks like the Governmental Activities budget is about $953 million so that didn’t make sense.  Either way, it confused me and then I decided it was mistyped. Especially with this kind of subject matter (budgets and government communication to the public), you want to be as clear as possible and review your charts to remove any possible confusion.
  • The data in the chart is the least emphasized thing in the image. The thing that is most eye-catching is the header row of the table or perhaps the graph title due to the use of bold colors.
  • Goal areas are abbreviated in the table and wrapped on multiple lines in the graph x-axis.
  • This chart could really use some context. I had to dig through the presentation and the city’s open data to understand the situation that explains this chart. A few sentences to go with it would have been nice so readers know what they are looking at.

So I decided to take this image and redo it with better design decisions. I ended up with two different versions. I used Excel to create these makeovers. I’m sure I’ll do future makeovers with Power BI or Tableau, but I think Excel worked just fine for this, and is probably what was used to make the original data viz.

Version 1

KCStat Chart Makeover Version 1 (click to enlarge)

First, I added some explanatory text for context to help communicate my message. If you don’t know about the Citizen Work Sessions, this chart might be confusing. The note about the shift in spending to public safety came from the FY 2016 – 2021 Citywide Business Plan. I thought it was important context to note as city council members and government officials have hard decisions to make when it comes to the budget.

So my data viz communicates two things:

  • explains the allocation of budget across goal areas
  • notes that there was a budget cut and shows the impact on the budget and allocation

You’ll notice in this version I used data bars embedded in my table rather than embedding a table inside of my chart.  Microcharts can be a good way to visually represent data while still providing detailed numbers. While tables primarily interact with our verbal system, graphs interact with our visual system to give us a quick picture of how different data points are related. The data bars form a horizontal bar chart, which is a great way to show a categorical comparison. Here I’m comparing the initial budget amount allocated for each goal area. My rows are ordered by descending value of the budget amounts.

Also, because I now have my labels on rows, I have plenty of room to type the full name of the goal area without abbreviating or wrapping.

I added the percent cut data because after seeing the numbers, I immediately start doing the math of (499 – 515)/515 to understand the difference. So I just saved everyone some time and showed it in my table. The absolute numbers are definitely useful, but it’s also important to understand the impact of the cut to each goal. A 16% cut could be devastating to an already small budget. I used conditional highlighting to help emphasize the impact of the cut, so the largest cut has the most intense color.

KCMO didn’t provide the individual data points for the markers, so I couldn’t use the individual data points in my chart makeover. But I think I might have chosen not to show the individual data points anyway because the value that they add is not worth the space they take up in the visualization. And they’re not core to my message. What I’m trying to show is where people generally made cuts when faced with that decision. I don’t have to show all the data points just because I have them.

The strategic use of color draws your eyes to the important parts of the data (the allocation to goals and the effect of the budget cut) rather than the box around the chart tile or the table column header like it did in the original visualization.

Version 2

KCStat Chart Makeover Version 2 (click to enlarge)

I used the same explanatory text in this version. This time, I created a clustered horizontal bar chart and directly labeled the bars, removing the need to show a table. This chart shows me two things: the allocation of the budgets across goals and the and the difference between the original budget and the average budget after cuts. This time I’m not focusing on the impact of the cut to the total budget for the goal, just visually showing where cuts happened and allowing the reader to make the comparison with the bars.

Again,  I switched to a horizontal bar char so I have a good categorical comparison and my labels have plenty of room and don’t have to be wrapped or abbreviated. I put the legend at the top left so you have that information before going to read the chart (our eyes tend to move across the chart in a “z”).

And again your eyes are drawn toward the color first, which is where the important data is located.

Data Visualization Critiques & Makeovers

Feel free to leave comments on how you would improve my chart makeovers. I would appreciate the feedback.

If you like chart makeovers, there are several sites that critique and redesign data visualizations. Here are some of my favorites: