Azure, Azure Data Factory, Azure Data Lake, Biml, Microsoft Technologies

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 2

I showed in my previous post how we generated the datasets for our Azure Data Factory pipelines. In this post, I’ll show the BimlScript for our pipelines. Pipelines define the activities, identify the input and output datasets for those activities, and set an execution schedule. We were creating several pipelines with copy activities to copy data to Azure Data Lake Store.

We generated one pipeline per schedule and load type:

  • Hourly – Full
  • Hourly – Incremental
  • Daily – Full
  • Daily – Incremental

We also generated some one-time load pipelines for DR/new environment setup.

The first code file below is the template for the pipeline. You can see code nuggets for the data we receive from the generator file and for conditional logic we implemented. The result is one copy activity per source table within the appropriate pipeline.

In the second code file below, lines 104 to 119 are generating the pipelines. We read in the necessary data from the Excel file:

  • Schema name
  • Table name
  • Columns list
  • Incremental predicate

Sidenote: We wrote a quick T-SQL statement (not shown) to generate the columns list. This could have been done in our BimlScript, but it was something we changed after the fact to accommodate the limitations of Polybase (Dear Microsoft: Please fix). SQL was quicker and easier for us, but if I were to do this again I would add that into our BimlScript. We needed to replace new lines and double quotes in our data before we could read it in from the data lake.  You can get around this issue by using .ORC files rather than text delimited files. But the ORC files aren’t human readable, and we felt that was important for adoption of the data lake with the client on this project. They were already jumping in with several new technologies and we didn’t want to add anything else to the stack. So our select statements list out fields and replace the unwanted characters in the string fields.

Our Excel file looks like this.

ADF Biml Metadata

Columns B, C, L, and M are populated by Excel formulas. This is the file that is read in by the BimlScript in the code below.

In our generator file (which is the same file that was used to generate the datasets), we use the CallBimlScript function to call the pipeline template file and pass along the required properties (table, schema, frequency, scope, columns list, predicate).

The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.

And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”

We made changes and regenerated our pipelines a few times, which would have taken hours without Biml. With Biml, it was no big deal.

Thanks to Levi for letting me share some of his code, and for working with me on this project!

 

Azure, Azure Data Factory, Biml, Data Warehousing, Microsoft Technologies, Uncategorized

Copying data from On Prem SQL to ADLS with ADF and Biml – Part 1

Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source
  • Linked Services – define the connection string and other connection properties for each source and destination
  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
  • Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

  • Frequency (daily or hourly)
  • Changes Only (incremental or full load)
  • Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets.  You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads.  I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.

Conferences, SQL Saturday

My Thoughts on SQL Saturday #596 – Denver BI

I had the pleasure of attending SQL Saturday Denver – BI this past weekend. They even let me help out a bit with registration and other volunteer tasks. This SQL Saturday was an experiment of sorts to prove out Steve Jones’s idea of slimmer SQL Saturdays. We had two tracks and 80 – 100 attendees. Steve would like to see each city be able to do 4 SQL Saturdays a year (which is currently against the rules), but keep them slim.

I think it’s great that Steve and Carlos put together the event for about $650 (and I heard it would have been close to $300, but they decided to do a speaker dinner to use up some sponsor money). This should show other organizers that their event doesn’t have to be big and expensive to be considered successful. Everyone had a good time and learned new things, and the venue was nice. They worked with a local university to get the space for free, which is much easier to do when you only need three rooms and a hallway. The quality of speakers was still quite high (Peter Meyers, Melissa Coates, Steve Wake, and others).

Part of the slimmer SQL Saturday is that they didn’t provide lunch. But our venue was within walking distance of several places, and it was nice to take a walk and get whatever food I wanted.

My Concerns and Things I’m Still Pondering

Here’s what I didn’t love or what I need more time to consider compared to other SQL Saturdays:

  • A lot of SQL family didn’t attend because they weren’t speaking and didn’t want to take up a spot for someone else who might be attending for the first time or needs the learning opportunities. For me, SQL Saturdays are about learning and community. I missed some of my SQL people. Having slimmer SQL Saturdays also means that the range of topics isn’t as broad, and there may be less incentive for more experienced people to attend (outside of the community aspect) if most/all of your topics are beginner level.
  • The little things matter to me. I ended up printing session evaluations so that speakers could get feedback and making sure people knew they could submit feedback online. Would the event have been fine without evals? Yes. But do some speakers very much want feedback from the audience, especially when trying out new sessions? Yes. If they had warned the speakers ahead of time, the speakers could have grabbed a few trusted people and asked them to attend their talk and provide feedback, making this a non-issue. I think whatever you can do to make things run smoothly and give people a good experience is usually worth it. Evals fall into that for me, but I fully acknowledge that they do not make or break the event.
  • There is still some overhead associated with planning even a small event. You still have to secure a venue, choose speakers and set the schedule, market the event, and spend your Saturday running the event. This is fine, sometimes even fun. I have organized 5 SQL Saturdays, and enjoyed it. But it is still time-consuming, and doing 4 of them a year makes me feel tired just thinking about it. If you can assemble a team of volunteers where 2 – 3 people plan and execute each event and you rotate duties, that sounds reasonable to me. Not every city has such a good team, though. We are more than just SQL people and lives get busy with personal or even other professional stuff. This needs to be something that isn’t overly burdensome for any one person in order to make it work.
  • Someone else made the comment “If we do these quarterly, what’s the difference between this and user group meetings? You would spend about 8 hours a quarter during the week attending meetings or 8 hours in one day attending a slimmed down SQL Saturday.” I can understand that thought process. I think of SQL Saturdays as a special once/twice a year thing. I don’t know that smaller/more frequent SQL Saturdays are better or worse than the norm, just different. I imagine that each city would find their own way to differentiate the value of SQL Saturday vs the user group.

My Takeaways

I hope this helps prove that a small event can be a great event. Do not feel like a failure just because your event doesn’t have 350 attendees or because you couldn’t get shirts and expensive gifts for the speakers and volunteers. I will admit that there was a bit of pressure to be bigger and better each year that I organized SQL Saturday KC, but that was almost entirely self-imposed. This was a good experience to help me really understand what is essential versus what is nice to have. When it comes down to it, having a slim but well planned SQL Saturday is better than not having one at all. That being said, if your SQL Saturday is large and well-funded and making people happy, don’t change a thing. Slimmer SQL Saturdays provide alternatives for events with smaller markets and/or smaller sponsorship availability.

I am now a fan of letting people get their own lunches, if your venue is in a location that can support it. Lunch at SQL Saturday KC was always expensive and took several volunteers to set up (taking money, placing orders, having food delivered and set up, accommodating dietary restrictions). And there were always people who felt like it wasn’t worth the $10/$15 dollars and wanted to complain to us afterward. As an organizer, I like the idea of skipping the headache of lunch and giving people the freedom to go get what they want. Plus it’s nice to take a walk after sitting in sessions all morning.

I love the partnership with University of Denver, not just use of their space. Part of the agreement made in getting the space at no cost for SQL Saturday Denver – BI  was that there would be sessions that were relevant and accessible for some of the students. Although there are more and more higher education programs focused on BI and data science, I still think the opportunity to get applied learning from the “real world” is valuable for them. I hope to see more SQL Saturdays partner with colleges and universities in the future.

I give this slimmer SQL Saturday two thumbs up and think others should consider it an option. Each event organizer should decide what’s important to them and make it happen. But know that you can have a good event for less than $1000 and minimal time spent planning if that is all you have.

Azure, Microsoft Technologies, Power BI, SSAS

Please Lend Me Your Vote for Documentation of TMSCHEMA DMVs

I spent a good bit of time looking for the definitions/descriptions of the TMSCHEMA DMVs that allow us to view metadata and monitor the health of SSAS 2016 tabular models. As far as I can tell there are no details about them on any Microsoft site. Many of the columns are obvious, but there are a few fields that show IDs rather than descriptions (e.g., ExplicitDataType in TMSCHEMA_COLUMNS, Type in TMSCHEMA_DATA_SOURCES). It would be great to get the DMVs documented similar to the MDSCHEMA DMVs as they are quite useful for tasks like documenting your tabular model.  Since the TMSCHEMA DMVs work in Azure Analysis Services as well, I have logged this request on the Azure AS User Voice for that. Please lend me a vote so we can make this information more easily available.

https://feedback.azure.com/forums/556165-azure-analysis-services/suggestions/16981015-document-tmschema-dmvs

Conferences, PASS Summit, Power BI

Update On My PASS Summit Feedback

Back in June, I posted the feedback I received on the abstracts I submitted to PASS Summit 2016. I wasn’t originally selected to speak, but I did have one talk that was selected as an alternate. It turns out that a couple of speakers had to cancel , and I am now speaking at PASS Summit.  PASS contacted me two weeks ago to ask if I could fill in. Luckily, my session that was chosen as an alternate was one that I give frequently at SQL Saturdays so it was feasible to get it ready to present at PASS Summit.

I am excited and honored to be speaking at PASS Summit, but I’m not stressed about it. Part of it is that I’ve only had two weeks to stress, and part of it is that I have quite a bit of experience speaking and I feel confident that I can deliver a decent (if not downright good) session. I’m still updating my presentation and practicing the delivery in preparation for this week, but I feel comfortable with the process of preparing. I’ve spoken at PASS Summit and many other conferences and user groups over the last five years. Over that time I have developed a process for preparing to speak, which includes building and editing the presentation as well as some disaster recovery measures. Once I have checked all the boxes I can feel that I have sufficiently prepared. It doesn’t mean everything will go perfectly, but it helps make sure I haven’t forgotten anything.

There are a couple of things I hope you take away from this.

  1. If you are selected as an alternate, it is a real possibility that you will be added to the schedule. This worked out nicely for me because I was already scheduled to be at PASS Summit and my presentation was already in a fairly finished state. You will definitely never be chosen if you never submit. So if you are interested, throw your hat in the ring next year.
  2. Giving a good presentation (and feeling confident about it) takes practice. If you are just beginning your speaking efforts in the SQL community and you are nervous about it, know that it gets easier (for most people) as you gain more experience. If you enjoy it, keep putting yourself out there. Speaking is something I thoroughly enjoy these days, and I have (mostly) conquered the nerves. Some of that is because I have experienced failures in presentations and the world did not end. Some of that is the result of my presentation prep checklist, which helps me feel prepared.

The downside of the last-minute addition is that I haven’t had as much time to market my session. So if you like Power BI, data visualization, or Kansas City BBQ, and you are attending PASS Summit, I invite you to join me on Friday morning at 8 am in room 3AB. I’m discussing how to apply good data visualization practices in Power BI using data on BBQ restaurants.

Here’s the description for the session I’m presenting:

Overcoming Chartaphobia with Power BI

Do reports in your organization consist mostly of giant tables of data? Perhaps you have gotten as far as adding KPIs or conditional highlighting to the tables. Maybe you have charts, but they are hideous and distracting. Although tables of data presented as reports are commonly found in many organizations, they may be doing you and your users a disservice. We’ll discuss why cognitive psychological studies tell us that graphs and pictures are more effective at communicating trends and comparisons and how to prepare to create good data visualizations. Then we’ll explore how to employ purposeful data viz designs to help users achieve their goal of making informed decisions, using a fun and useful Power BI dashboard. You’ll leave with guidance on how to take boring or unreadable tables of data and turn them into useful and visually appealing reports. 

I hope to see you in my session on Friday!

Microsoft Technologies, SSAS, SSIS

Process Compatibility Level 1200 SSAS Tabular Model from SSIS 2014

A client wanted to upgrade their SSAS model to SSAS 2016 to take advantage of some of the features of the new level 1200 compatibility model. But they weren’t yet ready to upgrade their SSIS server from SQL 2014. This presented a problem because they had been using the Analysis Services Processing Task to process their tabular model nightly. This processing task in SSIS 2014 uses the old Analysis Management Objects, which aren’t compatible with the new SSAS tabular models.

Attempting to use the AS Processing Task results in the following error: “[Analysis Services Execute DDL Task] Error: This command cannot be executed on database ‘MySSASDB’ because it has been defined with StorageEngineUsed set to TabularMetadata. For databases in this mode, you must use Tabular APIs to administer the database”

The reason for keeping SSAS processing in an SSIS package was because it kept consistent logging throughout their data refresh process. So we set out to find another solution.

The new SSAS Tabular models use Tabular Model Scripting Language (JSON) rather than XMLA. A simple process full command in TMSL might look like:

{
  "refresh": {
    "type": "full",
    "objects": [
      {
        "database": "MySSASDB"
      }
    ]
  }
}

Just pasting the JSON to process the model into an Analysis Services Execute DDL Task didn’t work. It returned the error “DDL is not valid”.

I asked around and received a couple of solutions that worked.

Option 1 (my less preferred option): Create an OLEDB connection manager (rather than MSOLAP) and use an Execute SQL Task.

You can set up an OLE DB connection that looks like the below.  Create a new connection manager and choose the OLE DB connection manager type. Change the provider to Microsoft OLE DB Provider for Analysis Service 13.0 and fill in your connection information.

ms-oledb-provider-for-as-conmgr

You’ll see the connection manager show up in the Connection Managers pane looking like this:

ssas-oledb-prov

Drag in an Execute SQL task, use the previously defined connection manager and paste in the JSON. This works just fine, but feels a bit too much like a workaround rather than a solid solution to me.

Option #2: Use an Analysis Services Execute DDL task and wrap the JSON in XMLA

Drag in an Analysis Services Execute DDL task. Create a new connection manager by choosing New Analysis Services Connection.

ssis-new-ssas-con

Edit your connection information, click OK, and you will end up with a connection manager that looks like this:

ssis-ssas-con-mgr

Use that connection manager in the AS Execute DDL task. We can use the same JSON from earlier and wrap it in XMLA as shown below.

<Statement xmlns="urn:schemas-microsoft-com:xml-analysis">
{
  "refresh": {
    "type": "full",
    "objects": [
      {
        "database": "MySSASDB"
      }
    ]
  }
}

</Statement>

The XMLA/JSON command can be a direct source statement or placed in a variable and referenced from the task.

To test that the model is successfully processed, you can execute the SSIS task or package and then run the following query against the DMV.

Select [catalog_name], [date_modified], [compatibility_level] 
from $SYSTEM.DBSCHEMA_CATALOGS where [Catalog_Name] = 'MySSASDB'

So if you are caught between versions in SSIS and SSAS, do not despair. You can still process your new SSAS Tabular model from an SSIS package.

Microsoft Technologies, Power BI, Power Pivot, Power Query, SSAS

Documenting your Tabular or Power BI Model

If you were used to documenting your SSAS model using the MDSchema rowsets, you might have noticed that some of them do not work with the new tabular models. For example, the MDSCHEMA_MEASUREGROUP_DIMENSIONS DMV seems to just return one row per table rather than a list of the relationships between tables.

Not to worry, though. With the new compatibility level 1200 SSAS Tabular models came some new DMVs. They are, unfortunately, not documented at this time (as of 3 Oct 2016). I expect them to show up here at some point in the future. Until then I published a Gist that provides a list of those views as well as some notes on what kind of useful info is in each one. While there are 36 TMSCHEMA DMVs, the main DMVs you will want to check out for model documentation purposes are:

  • TMSCHEMA_COLUMNS
  • TMSCHEMA_MEASURES
  • TMSCHEMA_MODEL
  • TMSCHEMA_RELATIONSHIPS
  • TMSCHEMA_TABLES

There are a few fields that show IDs rather than descriptions (e.g., ExplicitDataType in TMSCHEMA_COLUMNS, Type in TMSCHEMA_DATA_SOURCES). I’ve been able to figure out the descriptions that correspond to some of the values, but I’m still missing a few.

As with the MDSCHEMA DMVs, the TMSCHEMA DMVs can be used to document your model. This also works with Power BI models. You’ll just need to change the connection information to your Power BI instnce while the .pbix file is open in Power BI Desktop. For more information analyzing your Power BI model, see Chris Webb’s post. Since DMVs require DMX queries, you are somewhat limited in your ability to join the data from the DMVs together. This makes it necessary to store the data somewhere before transforming and merging them. Since I needed to document a tabular model for a client, I created a Power Pivot model in Excel 2016. You can download my documentation model here.  My example model was connected to a quick demo SSAS model I made based upon the Wide World Importers DW database.

My Power Pivot Model

My Power Pivot model uses Power Query to import the data from the DMVs. This enables me to merge data into a model that I feel makes the most sense and doesn’t suffer from too many relationships and hidden tables (which is what would happen if you just imported the results of each DMV query into the model). In order to make this work for any model, I employed a trick I learned from Chris Webb to avoid duplicating connection information across queries. I built my model with the normal embedded connection strings and then converted them to use the connection Connection Info tables. As I did that, I ran into a Power Query limitation as evidenced by the error message below.

“Formula.Firewall: Query ‘QueryName’ (step ‘StepName’) references other queries or steps and so may not directly access a data source. Please rebuild this data combination.”

Ken Puls has a great explanation of how to get around this issue. Basically, you have to redesign your queries so that you have “staging tables”, which you can then use to build finalized tables. This redesign gets you around the error since the finalized tables don’t have the embedded source connection, just a reference to the results from another query.

This Excel file documents:

  • Database and model info
  • Tables and source queries
  • Columns
  • Hierarchies
  • Measures
  • KPIs
  • Security roles, membership, and permissions
  • Relationships
  • Perspectives and the columns, hierarchies, and measures they contain

The Database Info tab uses cube functions to display the values shown. The rest of the tabs use pivot tables and slicers.

You can download and use this model if you have Excel 2016 or 2013 with Power Query. Just download the file, change the values in the TabularInstanceName and TabularDBName queries, and refresh. I was able to open the file and update the queries with no issues and no changes in Excel 2013.

This is great if you just need documentation or a quick way to look up specific information. But we could really analyze this using visualizations other than tables in Power BI Desktop.

My Power BI Model

I imported the Excel model into Power BI, made a few adjustments, and then added some visualizations. This report is a bit paired down from a visualization standpoint compared to the Excel file because I eliminated some of the data that wasn’t analytically relevant for me.

Click here to view this report in Power BI.

On the first page, I have several summary numbers that tell me about the contents of the model. It gives me a decent idea of the size and complexity of the model.

tabular-doc-summary

The second page uses a bar chart as an interactive filter against a list of table source queries and columns.

The third page shows relationships between the tables. Slicers on the left filter both the force directed graph and the table. Smart filters at the top of the page let you filter the Table and To Table in the relationships.

tabular-doc-relationshipsThe fourth page shows the details behind each measure. A column chart at the bottom provides a count of measures by table and whether the measure is hidden, which acts as another interactive filter.

The last page shows security by role and table, again using column charts as interactive filters.

You can download the Power BI model here.

Update: Check out the Demo Day video I made on the BlueGranite blog for further explanation and a demo of the Tabular Model documentation.