Data Visualization, DAX, Excel, Microsoft Technologies, Power Pivot

Power Pivot: Dynamically Identifying Outliers with DAX

I have been working on a project in which we were looking at durations as an indicator of service levels and customer satisfaction, specifically the maximum duration and average duration. We built a small dimensional data mart in SQL Server to house our data, then pulled it in to Excel using Power Pivot so we could make a few Excel and Power View reports and provide the client with the model for further ad hoc reporting and analysis. After the first iteration of development, we got feedback from the client that there were several outliers that were skewing the metrics. They wanted to see average and max duration for whatever group of objects they had chosen and then see the calculations after removing the top and bottom 5% of values.

Here is a small sample of my fact table, called Object Processing.  Field names and some small details have been changed to protect the innocent.

ObjectKey StatusKey CreationDateKey GeographyKey Duration
1 1 20140501 1 2193218
2 1 20140503 2 2114354
3 2 20140429 3

The first four columns are foreign keys to dimensions. The object dimension contains an object ID, name, owner, and content group. The status dimension contains an ID and description of the current processing status. The date dimension is a standard date dimension representing the date the object was created. The geography dimension contains an ID, country, and region. The duration column represents the number of seconds the object has spent processing. A new row is inserted into the fact table for each object created.

First, I have a few simple DAX calculations. These calculations work great on their own until you need to remove the outliers:

Total duration:= Sum('Object Processing'[Duration])
Avg duration:= Average('Object Processing'[Duration])
Max duration:= Max('Object Processing'[Duration])

To remove the outliers we need to rank the objects by duration. I first read Alberto Ferrari’s blog post on dynamic ranking in Power Pivot. It demonstrates a method to rank items within the table as a calculated column, which didn’t quite work for me because I wanted to dynamically rank items after filters were applied (ex: rank only the objects with a create date of yesterday and a status of completed). Then I found a series of posts on PowerPivotPro.com about Dynamic TopN Reports Via Slicers (parts 1, 2, 3, 4). That was what I needed to work through my similar problem.

First, I created a calculated measure for ranking the objects. If an object had no duration (because it had been created but had not yet reported a duration), I wanted it to be blank rather than rank as the lowest duration. In this formula, I’m ranking Object IDs in descending order based upon my total duration calculation.

Object Duration Rank:=
           if (ISBLANK([Total Duration]), BLANK(),
           RANKX(ALLSELECTED(Object[Object ID]), [Total Duration], ,0))

Next, I needed to identify the top and bottom 5%. First, I counted the total number of objects excluding null durations:

Object Count For Outliers:=
               CALCULATE(DISTINCTCOUNT('Object Processing'[ObjectKey]),
               ALLSELECTED('Object Processing'),
               NOT(ISBLANK('Object Processing'[Duration])))

Then I calculated the number of objects in the top 5%, giving me the rank of the last member of my top outliers group:

Outlier Count:=round([Object Count For Outliers]*.05,0)

To get the rank of the first member of the bottom outliers group I calculated:

Low Outlier Rank:=[Object Count For Outliers]-[Outlier Count]

Finally, I wrote a calculation that determines if the object is in range (not an outlier):

Is In Range:=
               if(ISBLANK([Total Duration]), BLANK(),
               if(Object Duration Rank]>[Outlier Count] && 
               [Object Duration Rank]<[Low Outlier Rank], 1, 0))

Now I could filter my pivot table containing the Is In Range calculation for only those values equal to 1 to see the average and max duration with outliers removed.

Ultimately, I produced a summary report similar to this:Object Duration SummaryWe were able to compare object durations for objects created in the last 1, 7, and  30 days, including and excluding outliers. The conditional highlighting is set so that cells are highlighted if the duration for that time period is greater than the duration for the prior time period listed below it. For example, average object duration for yesterday is 5:06:02, which is greater than the average duration for the last 7 days of 4:54:14.

I’m still working on a DAX calculated measure that will return the average object duration excluding outliers in one fell swoop, rather than needing to filter out the outliers in a pivot table.