DAX, Microsoft Technologies, Power BI, Power Query

Calculating Age in Power BI

In week 26 of Workout Wednesday for Power BI, I asked people to calculate the age of Nobel laureates at the time they received the award. I provided some logic, but I didn’t prescribe how to create the age calculation. This inspired a couple of questions and a round of data validation as calculating age may be trickier than you think. In this post, I’ll explore some of the ways people have calculated age in Power BI and the edge cases where those calculations may not work.

In my solution video for Workout Wednesday, I used Power Query to calculate age. This was inspired by several blog posts and videos I had seen previously. There is an Age menu option in the Power Query editor under Date.

Calculating Age with the Power Query Editor user interface

When you select a date column and use that Age option, it calculates the duration between the selected date and the current date in days. You must then replace the current date with the second date column. Next you can choose Total Years under Duration, which divides the days by 365. Finally, you must round that number down to the next integer to get years.

If you follow Ruth’s video, you can do all of that in one step that creates a custom column with the final age value.

 Number.RoundDown(Duration.TotalDays([Date2] - [Date1])/365) 

That is the most common option in Power Query as there is no DateDiff function.

There are a few options for calculating age in DAX. Some people use the DATEDIFF function.

Age DateDiff = DATEDIFF([Date1],[Date2],YEAR) 

Another way I have seen is to use YEARFRAC function.

Age YearFrac = INT ( YEARFRAC ( [Date1], [Date2], 1 ) )

The way Marco Russo suggests is to use QUOTIENT.

Age Quotient (DAX): 

Age Quotient = 

VAR Birthdate = [Date1]

VAR ThisDay = [Date2]

VAR IntBirthdate = YEAR ( Birthdate ) * 10000 + MONTH ( Birthdate ) * 100 + DAY ( Birthdate )

VAR IntThisDay = YEAR ( ThisDay ) * 10000 + MONTH ( ThisDay ) * 100 + DAY ( ThisDay )

VAR Age = QUOTIENT ( IntThisDay - IntBirthdate, 10000 )

VAR CheckedAge = DIVIDE ( Age, NOT ISBLANK ( Birthdate ) )

RETURN

    CheckedAge

As Marco points out, many people were using YEARFRAC, but there is a bug in the DAX implementation that causes it to occasionally return an incorrect answer for this purpose.

Checking the Numbers

I created a Power BI file to demonstrate the differences in these four calculations. You can download the file here. The image below displays the results in several tests. For each row, I’m using Date1 as the birthdate and Date2 as the “as of” date. You’ll notice that I focused on leap years for a few cases.

Table in Power BI with 10 date ranges showing the results from the four calculations. 6 of the 10 rows have different results across the calculations.
Example date ranges and result of the four age calculations

There are six of ten date ranges that have different results across the different calculation methods.

In the second row, the Power Query age calculation says that Feb 29 to Feb 28 in the following year is a full year. This may or may not be what you want depending on your requirements. I’m noting the difference so you can be aware. A similar thing occurs in the fifth row going from Feb 29, 2016 to Feb 28, 2020, and again on the 9th row going from March 1, 2019 to Feb 29, 2020.

On the third row, notice that the DAX DATEDIFF function calculates Feb 29 to Feb 27 of the following year to be a full year, despite it being a day or two short. Depending on what you do with leap years, you might consider Feb 29 to Feb 28 in the following year to be a full year, but that third row result means DATEDIFF is probably not the calculation I want. We see a similar result going from March 1 to Feb 28 of the following year.

YEARFRAC calculates that Feb 29 to Feb 28 in the following year is not a full year, which may be desirable. But it counts Feb 29, 2016 to Feb 29, 2020 as only three years. And we see that March 1, 2000 to March 1, 2021 is only counted as 20 years. So even without starting on a leap year, we get some incorrect results. Small numbers seem to be correct until it gets to about 13 years.

Using the QUOTIENT function provides what I consider to be the most correct results. It calculates Feb 29 to Feb 28 of the following year to be less than a year. It calculates Feb 29, 2016 to Feb 28, 2016 to be three years and not four. And it calculates March 1 to Feb 29 of the following year to be less than a year.

Which to use?

The QUOTIENT formula produces the most accurate results if you don’t want Feb 29 to Feb 28 the next year to be counted as a year. DATEDIFF and YEARFRAC produce too many incorrect results for me to ever suggest using them. Since there is a DAX option that produces more correct answers, I would just go for QUOTIENT instead of either of these two.

UPDATE: There is a better alternative! Imke Feldmann reminded me that there is an Number.IntegerDivide function in Power Query. So let’s take the logic from Marco’s DAX calculation and move it to Power Query:

(BirthDate as date, EndDate as date) =>
let
BirthDateInt = Date.Year(BirthDate)10000 + Date.Month(BirthDate)100 + Date.Day(BirthDate),
EndDateInt = Date.Year(EndDate)10000 + Date.Month(EndDate)100 + Date.Day(EndDate),
Age = Number.IntegerDivide((EndDateInt - BirthDateInt),10000)
in Age

The Power Query custom column created by invoking this function should produce better compression than a DAX calculated column. This might not be significant for a small dataset, but we should be efficient when we can.

Microsoft Technologies, Power BI, Power Pivot, Power Query, SSAS

Documenting your Tabular or Power BI Model

If you were used to documenting your SSAS model using the MDSchema rowsets, you might have noticed that some of them do not work with the new tabular models. For example, the MDSCHEMA_MEASUREGROUP_DIMENSIONS DMV seems to just return one row per table rather than a list of the relationships between tables.

Not to worry, though. With the new compatibility level 1200 SSAS Tabular models came some new DMVs. They are, unfortunately, not documented at this time (as of 3 Oct 2016). I expect them to show up here at some point in the future. Until then I published a Gist that provides a list of those views as well as some notes on what kind of useful info is in each one. While there are 36 TMSCHEMA DMVs, the main DMVs you will want to check out for model documentation purposes are:

  • TMSCHEMA_COLUMNS
  • TMSCHEMA_MEASURES
  • TMSCHEMA_MODEL
  • TMSCHEMA_RELATIONSHIPS
  • TMSCHEMA_TABLES

There are a few fields that show IDs rather than descriptions (e.g., ExplicitDataType in TMSCHEMA_COLUMNS, Type in TMSCHEMA_DATA_SOURCES). I’ve been able to figure out the descriptions that correspond to some of the values, but I’m still missing a few.

As with the MDSCHEMA DMVs, the TMSCHEMA DMVs can be used to document your model. This also works with Power BI models. You’ll just need to change the connection information to your Power BI instnce while the .pbix file is open in Power BI Desktop. For more information analyzing your Power BI model, see Chris Webb’s post. Since DMVs require DMX queries, you are somewhat limited in your ability to join the data from the DMVs together. This makes it necessary to store the data somewhere before transforming and merging them. Since I needed to document a tabular model for a client, I created a Power Pivot model in Excel 2016. You can download my documentation model here.  My example model was connected to a quick demo SSAS model I made based upon the Wide World Importers DW database.

My Power Pivot Model

My Power Pivot model uses Power Query to import the data from the DMVs. This enables me to merge data into a model that I feel makes the most sense and doesn’t suffer from too many relationships and hidden tables (which is what would happen if you just imported the results of each DMV query into the model). In order to make this work for any model, I employed a trick I learned from Chris Webb to avoid duplicating connection information across queries. I built my model with the normal embedded connection strings and then converted them to use the connection Connection Info tables. As I did that, I ran into a Power Query limitation as evidenced by the error message below.

“Formula.Firewall: Query ‘QueryName’ (step ‘StepName’) references other queries or steps and so may not directly access a data source. Please rebuild this data combination.”

Ken Puls has a great explanation of how to get around this issue. Basically, you have to redesign your queries so that you have “staging tables”, which you can then use to build finalized tables. This redesign gets you around the error since the finalized tables don’t have the embedded source connection, just a reference to the results from another query.

This Excel file documents:

  • Database and model info
  • Tables and source queries
  • Columns
  • Hierarchies
  • Measures
  • KPIs
  • Security roles, membership, and permissions
  • Relationships
  • Perspectives and the columns, hierarchies, and measures they contain

The Database Info tab uses cube functions to display the values shown. The rest of the tabs use pivot tables and slicers.

You can download and use this model if you have Excel 2016 or 2013 with Power Query. Just download the file, change the values in the TabularInstanceName and TabularDBName queries, and refresh. I was able to open the file and update the queries with no issues and no changes in Excel 2013.

This is great if you just need documentation or a quick way to look up specific information. But we could really analyze this using visualizations other than tables in Power BI Desktop.

My Power BI Model

I imported the Excel model into Power BI, made a few adjustments, and then added some visualizations. This report is a bit paired down from a visualization standpoint compared to the Excel file because I eliminated some of the data that wasn’t analytically relevant for me.

Click here to view this report in Power BI.

On the first page, I have several summary numbers that tell me about the contents of the model. It gives me a decent idea of the size and complexity of the model.

tabular-doc-summary

The second page uses a bar chart as an interactive filter against a list of table source queries and columns.

The third page shows relationships between the tables. Slicers on the left filter both the force directed graph and the table. Smart filters at the top of the page let you filter the Table and To Table in the relationships.

tabular-doc-relationshipsThe fourth page shows the details behind each measure. A column chart at the bottom provides a count of measures by table and whether the measure is hidden, which acts as another interactive filter.

The last page shows security by role and table, again using column charts as interactive filters.

Check out the Demo Day video I made on the BlueGranite blog for further explanation and a demo of the Tabular Model documentation.

Update 19 Sep 2017: I have turned the Power BI report into a Power BI Template for easier use. You can download it here

Microsoft Technologies, Power BI, Power Query

Using Power Query to Transform Website Data with Multiple Rows Per Entity

My colleague Hope Foley and I both enjoy a good craft beer. She has a great presentation on spatial data in SQL Server, which contains data about breweries. She mentioned she was gathering new data to add to her brewery database and showed me the brewery listing on the Brewery Collectibles Club of America website. This web page presents a challenge because of the way the data is laid out. Rather than having a table with one row per brewery, the web page uses one table per brewery with two rows in each table.
pqbeerwebsite
I typically use Microsoft Power Query to scrape data from websites for further analysis. Although this requires a bit of creativity (and some M), I knew Power Query was up to the task. And it gave me a chance to use my favorite Power Query function: Unpivot.

I opened up Excel and established my data source in Power Query. I chose From Web and entered the url: http://www.bcca.com/services/brewery_listing.asp. Once the data source loaded, I could see that there were over 3,000 tables within the page. pqbeertablemenu

Choosing table 0 showed me the columns headings for the overall table on the page.

PQbeertableheader

I could see the headings for the 6 fields, arranged in 2 rows as we see on the webpage.  I decided to figure out how to transform this table into the one row I needed and then figure out how to automate my process so I could reproduce the results for each subsequent table. I used the Power Query GUI to transform my table into 1 row with the following steps.

  1. On the Add Column tab, choose Add Index Column -> From 0.
  2. Select the Index column. On the Transform Tab, choose Unpivot Columns -> Unpivot Other Columns. This creates a table with 3 columns: Index, Attribute, and Value.
  3. Select the Index and Attribute columns. On the Home tab, choose Remove Columns -> Remove Columns.
  4. On the Transform tab, choose Transpose.

 

This creates the following M code, which you can see by clicking on Advanced Editor on the View tab.

let
 Source = Web.Page(Web.Contents("http://www.bcca.com/services/brewery_listing.asp")),
 Data0 = Source{0}[Data],
 #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
 #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1),
 #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
 #"Removed Columns" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Index", "Attribute"}),
 #"Transposed Table" = Table.Transpose(#"Removed Columns")
in
 #"Transposed Table"

Next I used that query to create a function by changing just two lines.

(gettable) =>
let
 Source = Web.Page(Web.Contents("http://www.bcca.com/services/brewery_listing.asp")),
 Data0 = Source{(gettable)}[Data],
 #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
 #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1),
 #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
 #"Removed Columns" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Index", "Attribute"}),
 #"Transposed Table" = Table.Transpose(#"Removed Columns")
in
 #"Transposed Table"

I can see the number of tables available in my data source, so I can create a query that invokes the function for the required number of tables, in my case 3744. As a final touch, I used the first row as headers.

let
Source = Table.Combine(List.Transform({0..3744}, Query1)),
#"First Row as Header" = Table.PromoteHeaders(Source)
in
#"First Row as Header"

Et voilà!

PQBeerFinalList

I now have a single table of data with one row per brewery.

Excel, Microsoft Technologies, Power BI, Power Pivot, Power Query

Moving Calculated Measures in Power Pivot for Excel 2013

I learned a lesson the hard way: I shouldn’t change field names and data types in Power Pivot on tables that were imported using Power Query. My changes broke the connection between the two tools, so when I refreshed a query in Power Query that was set to load the results to my data model it caused a new/duplicate table to be created in my data model instead of updating the original table.

PPBlogTables

I already had Power View reports built off of my model that I didn’t want to rebuild, and I didn’t want to leave a duplicate table in my model, but I needed to make the data model refreshable. The issue can be fixed, but it’s a bit tedious:

  1. Create relationships for the new table to match those of the old table
  2. Rename the old table to something else
  3. Rename the new table to the original table name (usually it has a 1 after the table name)
  4. Rebuild or move any calculated measures to the new table
  5. Delete the old table

I was fine until I got to #4. You can’t just copy a calculated measure to a new table by cutting and pasting. You can copy the formula from the formula bar and paste it into a cell on another table, but you will get a warning that you can’t have two measures with the same name.

PPBLogDupName

Going this route means pasting your measure, renaming the old measure, and then renaming the new copy of the measure to the original name.  Due to the lack of the ability to filter data in Power Map, I had to create several calculated measures in my data model to limit the data shown in my map.  The idea of copying, pasting, and renaming several dozen measures was less than appealing.

It should be noted that in the current version of Power Pivot, you can put calculated measures on any table in the model without affecting the result of a query using the measure. The table in which it is located is more of an organization issue rather than a technical calculation issue.

I thought there must be a better way to move a measure. I looked through all of the buttons on the ribbon in the Power Pivot window and saw nothing helpful so I turned to Google (yes, I still choose Google over Bing). A TechNet article provided just the guidance I needed:

“To move a measure, you can cut and paste the formula in the Calculation Area. You can only move it to a different location in the Calculation Area of the table in which it was created. You cannot move a measure to different table in this view; instead use the Measure Settings dialog in the PowerPivot ribbon in Excel to change the association.”

The Measure Settings dialog box held the answer, although I found these settings under the Calculated Fields -> Managed Calculated Fields.  PPBlogManageFields

It contains a drop-down box that allows me to choose the table where the definition of the measure should be stored. This was a much quicker way to quickly move several calculated measures to another table in Power Pivot.

PPBlogcalcFieldSet

I have built several Power Pivot models and never run into this particular situation. I can’t say I’m glad it happened, but I can appreciate that it reminded me that there are useful things on the Power Pivot ribbon in the Excel window that aren’t accessible through the Power Pivot window.

Data Visualization, Excel, Microsoft Technologies, Power Query, Power View

Most Popular Names Visualized in Excel

A couple of months ago, I came across an article in the Atlantic that showed an animated gif of the most popular baby names by state by year. I decided to visualize this data using various add-ins and chart types in Excel.

Retrieving the Data

I found the data on the Social Security Administration website. This data showed the top 5 girl names in a table for each state for a given year. I needed to pull in multiple years, so I used Power Query and wrote a bit of M. I followed the guidance from a great blog post by Devin Knight:

  1. Choose From Web and enter the url for a page containing any year of data.
  2. Modify the M to create a function that accepts a year parameter and uses that to formulate the correct url for each year.
  3. Create a second query from scratch and provide an array of years.
  4. Convert this data to a table and insert a calculated column that pulls in the data from the first query.
  5. Expand the columns, rename columns, and change to appropriate data types where necessary.

For more detail, see Devin’s post. He does a great job of explaining and providing screenshots.

Once I had my data it was time to show it off.

Power Map

My first instinct was to use Power Map since its specialty is showing changes in geospatial data over time. After trying each type of chart in Power Map, I found that the region shading was the closest to what I wanted.

Power Map Baby NameAlthough I wasn’t just trying to duplicate the animated gif, I found it difficult to display this data in an equally effective way in Power Map due to the lack of ability to add data labels or customize annotations. There was no good way to show the most popular name on the state. I chose to use annotations to display the name. You can use data fields to populate the title and description of an annotation. The biggest drawback of using annotations is that there is no way to resize or relocate them. If I were to add annotations for each state, they would overlap and you wouldn’t be able to read many of them nor would you be able to see most of the map. I was able to add enough annotations that there was usually one annotation for each of the names shown on the map at any time. One other issue I noticed is that occasionally an annotation will fail to update on time, so the data in the annotation will be old and correspond to a previous year.  If you watch the video for my Power Map, you will see that the annotation for Virginia gets stuck on 1983 for a few years, but it eventually catches up. Update: Microsoft responded to my question on this.  We figured out that the annotations will update as expected if you set the Time to Date(Year) and then change the time settings to “Data Shows for an Instant”.  I have updated my workbook with these settings. You can watch the video of my Power Map tour here. Outside of the inability to add good labels, watching the states change color over time works well. You can have a legend that shows the names with the colors, but it will show every name from the data set and will not be an effective use of space.

Power View

Power View provides the ability to create bubble maps. I made a bubble map tiled by year.

Power View Baby NameIf you click on a year tile, you can then use the arrow keys to flip through the years; you can also use the scroll bar below the tiles. I think this provides a better visual than the Power Map for this situation since it is easier to see which names go with which states. Because of the bubble colors, I can see how the trend went from Jennifer to Jessica to Ashley in the 1980s. We also notice that there are 5+ top names from the 1990s forward, whereas the 1980s had 2 – 4.  The only real problem I encountered with the map in Power View was the limitation of the colors in the themes. When there are several names on the map, some of the colors can be very similar.  I tried several themes and couldn’t find one where this wasn’t the case.

Pivot Table

Next I decided to stray from maps and see if there were other chart choices that suited this data.  I created a pivot table with data bars and a timeline slicer.

Excel pivot baby nameI can still tell which baby names were most popular across the US and see how the trends change over time. The piece of information that I lose in moving away from maps is seeing how trends affect states that are geographically near each other. For instance, I can tell that Jessica was popular in 1985 in states such as California, Florida, New York, and Michigan. But I don’t immediately recognize that Jessica was popular only in New England, Florida, and the Mountain West and West Coast because Ashley had taken over the    Mid-Atlantic and Midwest. One advantage of the pivot table is that I can select multiple years. So I can learn things like Ashley was the most popular name overall in the 1990s.

Adventures with Apps for Office

Microsoft Research published several free apps for data visualization in Excel.  I decided to use a couple of them to visualize the baby name data. You can access these apps on the Insert tab.  Click Apps for Office -> See All.  Then click Find More Apps at the Office Store. One thing to note about these apps is that they require the data to be formatted in a certain way.  See the instructions in the links below for details on this.

apps for office

Streamgraph

A streamgraph shows how a set of numbers has changed over time using their relative area. It’s kind of like an area chart that doesn’t use an absolute Y axis.

steam graph baby names

I do not think this is an effective way to display the data.  Area graphs and streamgraphs both fail when trying to display a large number of values.  In this specific case, you can only pick a color and the shades are determined for you. It’s difficult to distinguish some of the shades from others. The fact that the values for the names can overlap and aren’t a specific shape (like a rectangle where you only have to judge length) make it difficult to interpret quickly.  You lose most of the advantages of pre-attentive processing because you really have to study the graph to get information from it.

Treemap

A treemap is a hierarchical view of data that breaks it into rectangles and uses size and color to communicate information.

tree map

For some reason there is a size limitation on this treemap so I couldn’t make it big enough to where all the names would fit in the boxes. Since the treemap plots the boxes so they fit into the overall rectangle, years are not placed in sequential order.  In my treemap, both the color and size represent the number of count of names for the year. Using a treemap removes the ability for the viewer to see trends over time.  I have to look in the bottom row for 1980 – 1982 and then jump to the top left to find 1985. I also had to summarize to top names by year, losing any geographical/state information.  One thing this does do well is compare 2 – 3 values for a name across years.  For instance, I can compare the shade of green for Jennifer in 1981, 1983, and 1985 to see the decline in popularity of Jennifer over time.

And the winner is…

I think the Power View does the best job of communicating all of the available information from the data set while being visually appealing. We can see geographic relationships and temporal relationships.  It’s easy to read and quickly see which names were top in which states at any year.

You can download my Excel file here if you would like to check it out or try to improve upon it.