My colleague Hope Foley and I both enjoy a good craft beer. She has a great presentation on spatial data in SQL Server, which contains data about breweries. She mentioned she was gathering new data to add to her brewery database and showed me the brewery listing on the Brewery Collectibles Club of America website. This web page presents a challenge because of the way the data is laid out. Rather than having a table with one row per brewery, the web page uses one table per brewery with two rows in each table.
I typically use Microsoft Power Query to scrape data from websites for further analysis. Although this requires a bit of creativity (and some M), I knew Power Query was up to the task. And it gave me a chance to use my favorite Power Query function: Unpivot.
I opened up Excel and established my data source in Power Query. I chose From Web and entered the url: http://www.bcca.com/services/brewery_listing.asp. Once the data source loaded, I could see that there were over 3,000 tables within the page.
Choosing table 0 showed me the columns headings for the overall table on the page.
I could see the headings for the 6 fields, arranged in 2 rows as we see on the webpage. I decided to figure out how to transform this table into the one row I needed and then figure out how to automate my process so I could reproduce the results for each subsequent table. I used the Power Query GUI to transform my table into 1 row with the following steps.
- On the Add Column tab, choose Add Index Column -> From 0.
- Select the Index column. On the Transform Tab, choose Unpivot Columns -> Unpivot Other Columns. This creates a table with 3 columns: Index, Attribute, and Value.
- Select the Index and Attribute columns. On the Home tab, choose Remove Columns -> Remove Columns.
- On the Transform tab, choose Transpose.
This creates the following M code, which you can see by clicking on Advanced Editor on the View tab.
let Source = Web.Page(Web.Contents("http://www.bcca.com/services/brewery_listing.asp")), Data0 = Source{0}[Data], #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}), #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1), #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"), #"Removed Columns" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Index", "Attribute"}), #"Transposed Table" = Table.Transpose(#"Removed Columns") in #"Transposed Table"
Next I used that query to create a function by changing just two lines.
(gettable) => let Source = Web.Page(Web.Contents("http://www.bcca.com/services/brewery_listing.asp")), Data0 = Source{(gettable)}[Data], #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}), #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1), #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"), #"Removed Columns" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Index", "Attribute"}), #"Transposed Table" = Table.Transpose(#"Removed Columns") in #"Transposed Table"
I can see the number of tables available in my data source, so I can create a query that invokes the function for the required number of tables, in my case 3744. As a final touch, I used the first row as headers.
let Source = Table.Combine(List.Transform({0..3744}, Query1)), #"First Row as Header" = Table.PromoteHeaders(Source) in #"First Row as Header"
Et voilà!
I now have a single table of data with one row per brewery.
Nice example. I did something similar using an Index column in order to parse some poorly designed XML where each record was on more than one row.