Category: Uncategorized

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source
  • Linked Services – define the connection string and other connection properties for each source and destination
  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
  • Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

  • Frequency (daily or hourly)
  • Changes Only (incremental or full load)
  • Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets.  You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads.  I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.

Trekking through the DAX Jungle In Search of Lost Customers

I like to think I’m proficient at writing DAX and building SSAS tabular models. I enjoy a good challenge and appreciate requirements that cause me to stretch and learn. But sometimes I hit a point where I realize I must go for help because I’m not going to complete this challenge in a timely manner on my own.  I think this is part of how most IT professionals operate.

This, but replace adultier adult with DAX-ier (more knowledgeable?) BI Consultant

Luckily, I am part of a great SQL Server community, and I also work with some really smart people.

I was testing some calculations in my tabular model when I noticed that my Lost Customers calculation wasn’t working as expected. This was rather interesting to me since the calculation I was using was from DAXPatterns.com.  After some experimentation, I determined that the difference between my data situation, and the situation that fit the DAX Pattern was that my customer dimension was a Type 2 Slowly Changing Dimension. That meant I couldn’t use the customer key to identify a unique customer, since each customer could have multiple customer keys  (surrogate keys) with the same customer number (business key). Specifically, if a customer made a purchase in December 2015, then changed their name, then made a purchase in January and February, my calculation was counting them as lost because it was not recognizing that the customer with the new name and same customer number was actually the same customer.

My original calculation that didn't properly handle Type 2 SCDs:
 Lost Customers :=
 IF (
     NOT (
         MIN ( 'Date'[Full Date] )
             CALCULATE ( MAX ( Sales[Invoice Date] )ALL ( Sales ) )
     ),
     COUNTROWS (
         FILTER (
             ADDCOLUMNS (
                 FILTER (
                     CALCULATETABLE (
                         ADDCOLUMNS (
                             VALUES ( Sales[CustomerNo] ),
                             "CustomerLostDate"                             CALCULATE ( MAX ( Sales[Invoice Date] ) )
                             + [Lost Days Limit]
                         ),
                         FILTER (
                             ALL ( 'Date' ),
                             AND (
                                 'Date'[Full Date] < MIN ( 'Date'[Full Date] ),
                                 'Date'[Full Date]
                                     >= MIN ( 'Date'[Full Date] ) 
                                         - [Lost Days Limit]
                             )
                         )
                     ),
                    AND (
                         AND (
                             [CustomerLostDate] >= MIN ( 'Date'[Full Date] ),
                             [CustomerLostDate] <= MAX ( 'Date'[Full Date] )
                         ),
                      [CustomerLostDate] <= 
                       CALCULATE (MAX ( Sales[Invoice Date] )ALL ( Sales ) )
                     )
                 ),
                 "FirstBuyInPeriod"CALCULATE ( MIN ( Sales[Invoice Date] ) )
             ),
             OR ( 
                 ISBLANK ( [FirstBuyInPeriod] ), 
                 [FirstBuyInPeriod] > [CustomerLostDate] 
              )
         )
     )
 )

Having put forth a good effort on my own and not wanting to go down the rabbit hole for several days, I decided to post my issue to the Power BI forums where Marco Russo (one of the authors of DAXPatterns.com) was able to help me.
Based upon his suggestions, my final calculation ended up as shown below. Only the bolded part (the innermost ADDCOLUMNS function) was changed

Lost Customers :=
 IF (
     NOT (
         MIN ( 'Date'[Full Date] )
             CALCULATE ( MAX ( Sales[Invoice Date] )ALL ( Sales ) )
     ),
     COUNTROWS (
         FILTER (
             ADDCOLUMNS (
                 FILTER (
                     CALCULATETABLE (
                         ADDCOLUMNS (
                              CALCULATETABLE ( 
                                 VALUES ( Customer[Customer No] ), Sales ),
                                 "CustomerLostDate"CALCULATE (
                                     MAX ( Sales[Invoice Date] ),
                                     ALLEXCEPT ( Customer, Customer[Customer No] )
                              )
                                  + [Lost Days Limit]
                          ),
                         FILTER (
                             ALL ( 'Date' ),
                             AND (
                                 'Date'[Full Date] < MIN ( 'Date'[Full Date] ),
                                 'Date'[Full Date] >= 
                                     MIN ( 'Date'[Full Date] ) - [Lost Days Limit]
                             )
                         )
                     ),
                     AND (
                         AND (
                             [CustomerLostDate] >= MIN ( 'Date'[Full Date] ),
                             [CustomerLostDate] <= MAX ( 'Date'[Full Date] )
                         ),
                         [CustomerLostDate] <=  
                            CALCULATE ( MAX ( Sales[Invoice Date] )ALL ( Sales )                            )
                     )
                 ),
                 "FirstBuyInPeriod"CALCULATE ( MIN ( Sales[Invoice Date] ) )
             ),
             OR ( 
                ISBLANK ( [FirstBuyInPeriod] ), 
                [FirstBuyInPeriod] > [CustomerLostDate] 
             )
         )
     )
 )

In more human-friendly terms, I think this calculation does the following:

  • Create a table of customer numbers that have sales, along with their customer lost date, which is defined as the max invoice date for that customer number after removing all context filters on the customer table except for customer number + the number returned by Lost Days Limit (in my case, 60)
  • Filter the dates on the calculated table such that the [min invoice date minus 60] is less than the selected date which is less than and the minimum invoice date
  • Filter the dates on the calculated table such that the customer lost date is between the minimum invoice date and the maximum invoice date selected and is less than the max invoice date for all sales.
  • Add a column called FirstBuyInPeriod that contains the min invoice date for the selected dates
  • Filter the calculated table such that the minimum invoice date in the selected dates is blank or is greater than the customer lost date

This worked fine and gave me the answer I was looking for, but was (as noted on the DAX Patterns page) rather computing-intensive.

My coworker, Javier Guillen offered a great alternative to Marco’s pattern.

First Date Selected :=
 FIRSTDATE ( DATEADD ( 'Date'[Full Date], 0MONTH ) )
Last Date Selected :=
 LASTDATE ( DATEADD ( 'Date'[Full Date], 0MONTH ) )
Customer Lost Date :=
 CALCULATE (
     MAX ( Sales[Invoice Date] ),
     CALCULATETABLE (
         Sales,
         FILTER (
             ALL ( Customer ),
             Customer[Customer No] = MAX ( Customer[Customer No] )
         ),
         FILTER ( ALL ( 'Date' ), 'Date'[Full Date] <= MAX ( 'Date'[Full Date] ) )
     )
 )
     + [Lost Days Limit]
Lost Customer :=
 SUMX (
     VALUES ( Customer[Customer No] ),
     IF (
         [Customer Lost Date] >= [First Date Selected]
             && [Customer Lost Date] <= [Last Date Selected],
         1
     )
 )

These calculations do the following:

  • First Date Selected: Calculate the minimum date of the dates selected
  • Last Date Selected: Calculate the maximum date of the dates selected
  • Customer Lost Date: Create a table based upon the Sales table that returns the max invoice date + [Lost Days Limit]  with a filter on customer to get the max customer number for each customer record (this is how we alleviate the type 2 SCD issue) and a filter on dates to get dates less than the max selected date
  • Lost Customer: For each customer number, count 1 if the customer lost date is between First Date Selected and Last Date Selected, otherwise count 0. Return the sum.

Javier’s suggestion is less resource intensive and returns results more quickly. But it requires that the date range you choose be contiguous. This means that I can’t create a pivot table and choose Jan and February of 2015 and January and February of 2016 and expect to get a correct answer. That is kind of an edge case situation that might not even be an issue, but I already had a situation for my project where YOY analysis of customer counts was needed, so I opted to go with Marco’s suggestion for now. If you don’t have this use case, I think Javier’s calculations are much easier to understand and implement. You could probably put some conditions around it to blank out if the dates are not contiguous and it would be ready to go.

It was cool to find something that wasn’t already covered by a common DAX pattern and wasn’t frequently blogged about, and I’m grateful for the insight from the solutions offered by both Marco and Javier.

Creating a Matrix in Power BI With Multiple Values on Rows

This week I was asked to create a matrix in a Power BI report that looks like this:

Matrix with Values on Rows
Matrix with Values on Rows (numbers faked to protect the innocent)

To my surprise, Power BI only lets you put multiple values on columns in a matrix. You can’t stack metrics vertically. Note: this is true as of 8 Jan 2016 but may change in the future. If you agree that this should be a feature in Power BI, please make your voice heard and vote for this idea on the Power BI forum and encourage others to vote for it as well.

Since I needed to finish this report, I got a little creative with a disconnected table and some DAX to implement a workaround in my tabular model.

The Disconnected Table

I created a table called Row Order in my tabular model using the following SQL Query as a source.

select 1 as [Row Order], 'Lost Customers' as [Customer Metric]
union all
select 2 as [Row Order], 'Inactive Customers' as [Customer Metric]
union all
select 3 as [Row Order], 'New Customers' as [Customer Metric]

I set the Sort By Column property on the Customer Metric column to Row Order and hid the Row Order column.

Calculated Measures

I already had the base metrics for my matrix:

  • Lost Customers – Week (customers considered lost in the current week, based upon the selected date)
  • Lost Customers – Month (customers considered lost in the current month, based upon the selected date)
  • Lost Customers – Week (customers considered lost in the current month, based upon the selected date)
  • Inactive Customers – Week (customers considered to have moved to inactive status in the current week, based upon the selected date)
  • Inactive Customers – Month (customers considered to have moved to inactive status in the current month, based upon the selected date)
  • Inactive Customers – Quarter (customers considered to have moved to inactive status in the current quarter, based upon the selected date)
  • New Customers – Week (customers who have made their first purchase in the current week, based upon the selected date)
  • New Customers – Month (customers who have made their first purchase in the current month, based upon the selected date)
  • New Customers – Quarter (customers who have made their first purchase in the current quarter, based upon the selected date)

As a side note, when creating time-based measures I prefer to create the measure such that it accepts a date and then filters off of that rather than assuming everyone wants to see the current week/month/quarter. I then add an Is Today field on my date table so if someone does want to see the current week they easily select that in a filter on this field. This means that I can see lost customers for any week, not just the current. Maybe I want to filter my dashboard to the end of December. If I had just created a measure that assumed I wanted last month, I wouldn’t be able to just change my date filter and get the data I want.  As another example, lots of people create a metric to calculate sales in the last 6 weeks. But many people assume they want the last 6 weeks from today rather than allowing the user to choose a date and then calculating sales in the 6 weeks leading up to that day. I much prefer the more flexible calculation.

But having this flexibility in my calculations meant that I couldn’t just put some calculated column for time period in my date dimension. And even if I had made the assumption that I wanted to see the data as of today, it would probably be a separate table or a many-to-many relationship (e.g., 5 Jan 2016 would be in the last week, month, and year).

With my new disconnected table added, I just needed to add a few more measures. First, I needed something to check which customer metric I should be showing:
RowOrder:=Max('Row Order'[Row Order])

I hid the RowOrder measure. Then I created one measure for each time frame (week/month/quarter) that would select the correct measure based upon the Customer Metric on that row.
Weekly Customer:=IF(HASONEVALUE(Row Order[Row Order]), SWITCH([RowOrder],1,[Lost Customers - Week], 2,[Inactive Customers - Week], 3,[New Customers - Week],BLANK()), BLANK())
Monthly Customers:=IF(HASONEVALUE(Row Order[Row Order]), SWITCH([RowOrder],1,[Lost Customers - Month], 2,[Inactive Customers - Month], 3,[New Customers - Month],BLANK()), BLANK())
Quarterly Customers:=IF(HASONEVALUE(Row Order[Row Order]), SWITCH([RowOrder],1,[Lost Customers - Quarter], 2,[Inactive Customers - Quarter], 3,[New Customers - Quarter],BLANK()), BLANK())

Power BI Matrix

I opened my report on Power BI.com and created my matrix.

I put Customer Metric in Rows and Weekly Customers, Monthly Customers, and Quarter Customers in Values. Then I filtered my table to As of Today = “As Of Today”.

PBI Matrix
Fields List for Power BI Matrix

That gives me the matrix I wanted to create. I don’t love that I had to add the extra table, but it works for now until I can achieve this natively with Power BI.

Power BI Matrix
Power BI Matrix using disconnected table and calculated measures (numbers changed to protect the innocent)

 

I’m Speaking At IT/Dev Connections

I am honored to be part of the fantastic group of speakers in the Data Platform & Business Intelligence track at IT/Dev Connections this year. IT/Dev Connections is happening September 15 though 17th (with Pre-Cons on the 14th) in Las Vegas, Nevada at the ARIA Resort & Casino. The abstracts for my two presentations are below.

Overcoming Chartaphobia: Moving Your Organization Toward Interesting and Enlightening Data Visualization

Do your organization’s reports consist mostly of giant tables of data? Perhaps you’ve gotten as far as adding KPI indicators to the tables. Although tables of data presented as reports are commonly found in many organizations, they might be doing you and your users a disservice. You should optimize your data visualization just as you optimize your databases. Reports are a user interface that must be purposefully designed to help users achieve their goal of making informed decisions. In this session we’ll review some data visualization best practices and apply those concepts to the use of Microsoft reporting tools such as Excel, Power View, and SQL Server Reporting Services. You’ll leave with guidance on how to take boring, unreadable tables of data and turn them into useful and visually appealing reports.

Who Needs SSAS When You’ve Got SQL?

For SQL and .NET developers, SQL Server Analysis Services might seem foreign or unnecessary; however, it can offer many advantages for reporting and data exploration. In this session we’ll cover useful features of SSAS and discuss conditions under which it’s beneficial. Next, we’ll compare the two types of SSAS databases (multidimensional and tabular) and cover how to decide which type is right for your solution. Then we’ll explore common ways to retrieve data and browse your SSAS database, such as Excel, reporting tools like SSRS, and custom .NET applications.

Join Me at IT/Dev Connections

If you haven’t registered to attend, it’s not too late! And I have a discount code for you: Use discount code SPKRSOC15 to get $400 off.

In addition to my sessions, I’m excited about several others that are great topics and great speakers.

I hope to see you there!

 

New Speaking Opportunities This Summer

I’ve enjoyed speaking at several SQL Saturdays and some local user group meetings over the past couple of years. This summer I’ve been invited to speak at some new venues.

First up is a webinar through the PASS Business Analytics Virtual Chapter.  The PASS BA VC is a great place to find free online training each month. I’ll be talking about The Accidental Report Designer: Data Visualization Best Practices in SSRS on Thursday, June 19th at 11AM CDT. If you are interested, please register here.  This is my first webinar, so I am interested and excited to try out this format. Here is the abstract for my webinar:

Whether you are a DBA, a developer, or an analyst, there is a good chance that you will have to create reports for coworkers, company executives, or clients. As with any UI design, careful consideration should be given to your data visualization design to ensure you are effectively communicating the intended message and providing a good user experience. While the principles are applicable across reporting platforms and tools, this session will contain demos and examples implemented in Reporting Services using SQL Server Data Tools. Learn how to make information (not just data) the focus of your report and provide your audience with something better than just shiny!
Session Level: Intermediate

I will be speaking at Mile High Tech Con in Denver, Colorado on July 25th and 26th . It features 3 days of content about data warehousing and business intelligence.  This is my first non-PASS event, and I’m looking forward to visiting Denver once again. The cost of the conference is fairly reasonable, so you might consider it if you are looking for a good BI conference to attend.  The pre-cons cost extra, but the main conference on Friday and Saturday is $475.  You can use my discount code for $40 off of that price: 14SK_MeaganLongoria

I hope you will join me at both of these great events.

The One Book

I tend to get some variation of the following question as I present at SQL Saturdays and work with clients:

What is the one book I should read to gain a good understanding of this topic?

There are many great books out there on business intelligence, data warehousing, and the Microsoft BI stack.  Here is my list of those that had a significant impact on my learning and career or that I consider the current defining book on the subject.

Data Warehousing/Dimensional Modeling –  The Data Warehouse Toolkit by Ralph Kimball and Margy Ross (link is to the newest edition of the book) defines dimensional modeling, walks through many common data scenarios and explains recommended data modeling and ETL practices. If you are designing, building, or enhancing a dimensional data warehouse, please read this book.

Master Data ManagementMaster Data Management by David Loshin thoroughly explains the resources required, helps clarify goals, and discusses the challenges of implementing a master data management program in your organization. The book also discusses architecture options and provides a roadmap that you can adapt to your project. I actually did an independent study class based upon this book as part of my MBA. I would recommend it to anyone just getting into MDM as a great platform agnostic view of MDM.

Agile Data WarehousingAgile Data Warehousing Project Management: Business Intelligence Systems Using Scrum by Ralph Hughes is an enlightening read about applying the agile process to data warehousing by someone who understands both.  Use this book to help set expectations at the beginning of projects, estimate effort, decide what should be included in an iteration, and provide value early in your project.  It addresses DW enhancements as well as new DW projects.

SSISProfessional Microsoft SQL Server 2012 Integration Services contains great information whether you are just learning SSIS or you just need to understand the new features in 2012. It covers each task and data flow component, offers tips for performance tuning, and provides great examples.  If you haven’t moved up to SQL Server 2012 yet, you will want to learn about the project deployment model, environment variables, and project-level connection managers, which are all covered in this book.

SSAS TabularMicrosoft SQL Server 2012 Analysis Services: The BISM Tabular Model by Marco Russo, Alberto Ferrari, and Chris Webb
There aren’t that many books available on SSAS tabular models since it is fairly new. This one got me through a couple of projects as I learned tabular cubes and DAX.  You will read almost every page while building your first tabular cube.  Then you will return to find helpful tips on writing DAX calculations.

Data Visualization –  Information Dashboard Design by Stephen Few (link is to the newest edition of the book)
This book is what got me interested in data visualization.  The way I design reports and dashboards changed after I read it. I was torn whether to go with this book or Show Me the Numbers, but I think Information Dashboard Design covers the main points of Show Me the Numbers that I most want people to understand (it was also the first of Few’s books that I read). Today’s technology enables us to make reports that are high on shiny and low on actionable information. I see two troubling trends: reports that are just wide tables with several hundred lines that no one can read at one time, and really shiny dashboards with lots of bells and whistles that make them look really cool. This book will tell you there is an appropriate time and place for tables and dashboards.  And while visual appeal is good, it should not be at the expense of effectively communicating the intended message of the information.

Up Next

I just picked up some books that look very promising and may cause me to update my list above:

 

Happy reading!

I’m speaking at SQLSaturday #236 in St. Louis

I’m excited to be speaking at SQLSaturday #236 in St. Louis, MO on August 3, 2013.  I will be presenting a session on Excel Cube Functions.  Excel Cube functions have been around for a while and in my opinion do not get the acclaim they deserve.  I learned to use them in my first job out of college on the financial reporting team and became a reporting rock star (back before shadow IT and self-service BI were cool).  They were also an intro to MDX for me.  This was the beginning of a path that led me to my current job as a Business Intelligence consultant. Excel cube functions are still very relevant today, so I’m spreading the good word.  Here’s my session abstract:

Don’t Miss Out on Excel Cube Functions

Cube functions have existed since Excel 2007, but they’re not as well known nor as well publicized as other methods to obtain data from Analysis Services/PowerPivot sources. Business Analysts may feel confined to tabular (symmetrical) reports with pivot tables. BI Developers may avoid Excel because they want the ability to write calculations in MDX and the extensive formatting capabilities of Reporting Services. And Power View limits you to the sets, calculations, and captions available in the data source and doesn’t provide drillthrough detail. We’ll start with the basics of how to write and use cube functions, then employ them to build a nicely formatted, asymmetrical, parameter-driven report that overcomes the previously noted limitations. In addition to facilitating new methods for data retrieval and report design, learning cube functions can help business analysts understand the MDX language and BI developers appreciate the power of Excel’s formulas and formatting options.

I’m also participating in a joint effort with the amazing Bill Fellows (@billinkc on twitter, blog at http://billfellows.blogspot.com/) about BIDS Helper. We both think BIDS Helper is a must-have for BI developers. We’ll walk you through our favorite features and talk about real development situations where BIDS Helper was useful in each part of the process of developing a BI solution (ETL, OLAP cube, and reports). As with most Bill Fellows sessions, there will probably be candy involved.

Somebody got BIDS Helper in my Data Tools

Do you develop SSIS packages, SSRS reports or SSAS cubes (multidimensional or tabular)? Then this class is for you! We’ll cover the capabilities this free tool empowers you to perform as well as address some of the grating little quirks of BIDS/SSDT.

Analysis Services Multidimensional:

  • Roles Report
  • Printer Friendly Dimension Usage

Analysis Services Tabular:

  • Tabular Actions Editor
  • Tabular Display Folders
  • Tabular HideMemberIf

SSRS:

  • Delete Dataset Cache Files

SSIS:

  • Deploy packages from BIDS/SSDT
  • Identify items with configuration/expressions
  • Create packages based on templates
  • Find differences between packages

Learn about this perfect combination from Meagan Longoria and Bill Fellows and you’ll never want to have your BIDS without the helper!

You can also catch Bill at PASS Summit 2013.  He will be presenting Type More, Click Less: Programmatically Build SSIS Packages. After SQLSaturday #236, it will be time to get ready for SQLSaturday #191 here in Kansas City on September 14.  I enjoyed organizing the Kansas City SQLSaturday last year so I’m helping once again this year.  If you are in the Kansas City area (or can travel here), I hope you will join in on the learning, networking, and fun. The call for speakers is open until July 16th if you would like to present.