Azure, Azure SQL DB, Microsoft Technologies, T-SQL

Altering a Computed Column in a Temporal Table in Azure SQL

System-versioned temporal tables were introduced in SQL Server 2016. They provide information about data stored in the table at any point in time by storing an effective dated version of each row rather than only the data that is correct at the current time

You can alter a temporal table to add or change columns, but you must first turn off system versioning. Let’s look at an example.

CREATE TABLE [dbo].[DatabaseSize](
	 [DatabaseID] [varchar](200) NOT NULL 
	,[ServerName] [varchar](100) NOT NULL
	,[DatabaseName] [varchar](100) NOT NULL
	,[SizeBytes] [bigint] NULL
	,[SizeMB]  AS ([SizeBytes]/(1048576))
	,[ValidFrom] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL
	,[ValidTo] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL
	,PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
	,CONSTRAINT PK_DatabaseSize_DatabaseID PRIMARY KEY CLUSTERED (DatabaseID)
) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));

Temporal tables must have a primary key defined. They also must contain two datetime2 columns, declared as GENERATED ALWAYS AS ROW START / END. The statement above creates both the current table and a history table.

The history table has the same schema as the current table, with one difference: the SizeMB column in the history table is not a computed column.

The dbo.DatabaseSize table is a system-versioned table. The DatabaseSizeHistory table is the related history table. DatabaseSizeHistory contains the same columns as DatabaseSize, except the SizeMB column is not a computed column in the history table.


When I initially created the table, I typoed the formula in the computed column. You can’t alter a computed column — you must drop and recreate the column. This is no problem, just turn off system versioning and alter your table, and turn system versioning back on.

But if you try this without specifying your history table, you will find that it stops using the history table created earlier and makes a new history table.

dbo.DatabaseSize is a system versioned table. The history table now shows as dbo.MSSL_TemporalHistoryFor_1909581841

If you specify your history table while turning system versioning back on, you will encounter an error:

Setting SYSTEM_VERSIONING to ON failed because column 'SizeMB' at ordinal 5 in history table 'Test.dbo.DatabaseSizeHistory' has a different name than the column 'ValidFrom' at the same ordinal in table 'Test.dbo.databasesize'.

Temporal tables match the columns between the current table and history table not only by name and data type but by the column’s ordinal position. Dropping and adding the computed column changed its order as it was added to the end of the table.

You can change the column order of a table in the SQL Server Management Studio UI by right-clicking on the table, selecting Design, and then dragging the column to the correct position. Note that you cannot do this on the system-versioned table while system versioning is on. You can either change the column order on the history table, or turn system versioning off and then change the current table.

dragging the SizeMB column to the bottom of the columns list in the table.

Once the column orders match, you can turn system versioning back on and specify the the history table.

ALTER TABLE [dbo].[DatabaseSize]
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));

This time, the command should complete successfully. You’ll want to drop any unused auto-created history tables before you’re finished.

Azure, Azure Data Factory, Microsoft Technologies

Control Flow Limitations in Data Factory

Control Flow activities in Data Factory involve orchestration of pipeline activities including chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline. They also include custom-state passing and looping containers.

The activities list in the ADF Author & Manage app, showing Lookup, Set variable, Filter, For Each, Switch, and more.
Control Flow activities in the Data Factory user interface

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.

  1. You cannot nest For Each activities.
    Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.
  2. You cannot put a For Each activity or Switch activity inside of an If activity.
    The Data Factory UI will prevent you from doing this by removing the For Each and Switch from the activity list. You can redesign the pipeline to put the inner activity inside a child pipeline. Also note that you can put an If activity inside of a For Each activity.
  3. You cannot use a Set Variable activity inside a For Each activity that runs in parallel.
    The Data Factory UI won’t stop you, but you’ll quickly learn that the scope of the variable is the pipeline and not the For Each or any other activity. So you’ll just overwrite the value in no particular order as the activities execute in parallel. The workaround for this is specific to your use case. You might try using an existing attribute of the item you are iterating on instead of setting a variable. Append Variable works fine, since each loop could add a value. But again, don’t count on the order being meaningful.
  4. You cannot nest If activities.
    The Data Factory UI will prevent you from nesting the If activities. If you need to have two sets of conditions, you can either combine conditions or move the inner condition to a child pipeline.
  5. You cannot nest Switch activities.
    Similar to the If activity, the Data Factory UI will prevent you from nesting Switch activities. And again, you can either combine conditions or move the inner condition to a child pipeline.
  6. You cannot put a For Each or If activity inside a Switch activity.
    The Data Factory UI will prevent you from doing this. You can move the inner activity to a child pipeline if needed.
  7. You cannot use an expression to populate the pipeline in an Execute Pipeline activity.
    It would be great to design a truly dynamic pipeline where you could have a dataset that defines which pipelines to execute, but you can’t do that natively in the Data Factory UI. The Invoked Pipeline property doesn’t allow dynamic expressions. If you need to dynamically execute pipelines, you can use Logic Apps or Azure Functions to execute the pipeline.
  8. You cannot dynamically populate the variable name in Set Variable and Append Variable activities.
    The Data Factory UI only allows you to choose from a list of existing variables. As a workaround, you could use an If activity to determine which variable you will populate.
  9. The Lookup activity has a maximum of 5,000 rows and a maximum size of 4 MB.
    If you need to iterate over more than 5000 rows, you’ll need to split your list between a child and parent pipeline.

In addition to the items mentioned above, also note these resource limits listed in Microsoft Docs. Limits like 40 activities per pipeline (including inner activities for containers) can bite you if you aren’t careful about implementing a modular design. And if you do have a modular design with lots of pipelines calling other pipelines, be aware that you are limited to 100 queued runs per pipeline and 1,000 concurrent pipeline activity runs per subscription per Azure Integration Runtime region. I don’t hit these limits too often, but I have hit them.

This is not to say you can’t create good solutions in Azure Data Factory—you absolutely can. But Data Factory has some limitations that you might not expect if you have experience working with other data integration/orchestration tools.

Have you hit any other limits that caused you to design your pipelines differently that you would like to share with others? If so, leave me a comment.

Azure, Azure Data Lake, Microsoft Technologies

Initial Thoughts on Dremio

I’ve been working on a project for the last few months with a client who has chosen to implement Dremio in Azure. Dremio is a data lake engine that creates a semantic layer and supports interactive queries.

Dremio logo
The Dremio logo

It uses Apache Arrow, Gandiva, and Parquet files under the hood. It runs on either Linux VMs or Kubernetes containers. Like most big data systems, there is at least one coordinator node and one or more executor nodes. These nodes communicate and are managed using Apache Zookeeper. Client applications connect to Dremio via ODBC, JDBC, REST APIs, or Arrow Flight. Dremio can read from storage accounts, external databases, and a few other sources.

Dremio stores data in the following places:

  • Metadata is stored in a RocksDB database on the coordinator node.
  • Frequently read data is cached on the executor node.
  • Memory-intensive query operations may cause an executor node to spill Arrow buffers from RAM to disk.
  • Reflections, user uploads, and query results are stored in the data lake.

Dremio is organized into spaces, which can contain folders and datasets. The key objects in Dremio are:

  • Data source – connection strings to data that should be accessed via Dremio
  • Physical Dataset – an HDFS directory or a database table
  • Virtual Dataset – a view of sorts, created using the Dremio UI or by writing SQL, that references one or more physical or virtual datasets and also provides lineage to its sources
  • Reflection – a materialized view that is transparent to users and is used to improve query performance, which seems to be implemented as Dremio querying data from the source and storing it as a parquet file for quicker access.
  • Space – a shared location for virtual datasets, a way to group related datasets and provide user access

Once you have your spaces and virtual datasets set up, it feels kind of like a database. If you connect with Power BI, virtual datasets appear as views and physical datasets appear as tables. Dremio metadata (catalogs, schemas, physical datasets, virtual datasets and columns) can be accessed using INFORMATION_SCHEMA queries, which is conveniently familiar if you are used to working with SQL Server.

Some nice features found in Dremio on Azure

  • Dremio allows Single Sign-On with AAD credentials. Permissions can be granted to individual users or AAD groups.
  • Dremio can be implemented in a virtual network in Azure. The executor nodes can use Private Link to access ADLS (Azure Data Lake Storage Gen 2) over a private endpoint.
  • Changes to virtual datasets are tracked in Dremio. It’s easy to revert to a previous version at any time.
  • Dremio gives you visibility to the jobs running queries, both for ad hoc queries from client tools and for refreshing reflections.
  • Administrators can create rules to assign queries to different queues in order to provide workload isolation and predictability for users.
  • When reviewing jobs, you can see a sort of query plan as well as which jobs were able to use a reflection to accelerate a query.
  • The lineage view for a virtual dataset is nice for understanding dependencies.
  • You can trigger refreshes of metadata or reflections via the Rest API, which is handy if you have ETL processes adding new data to your data lake, and you want to refreshes to occur at the end of the ETL process.

Some rough edges on Dremio in Azure

  • Dremio was initially built for AWS, not Azure. This is evident in the training materials, the product roadmap, and the knowledge of the Dremio implementation specialists. This is not to say it doesn’t work on Azure, just that the implementation is a bit rougher (e.g., no Azure templates made for you), and a couple of features are unavailable.
  • Dremio doesn’t integrate with Azure Key Vault. You store the service principal secret or storage account access key in a configuration file on the Linux VM. I’ve been told this is on the roadmap, but I didn’t hear a date when it would be available.
  • You can enable integration points on the Dremio website where you can click a button to open a connection to a virtual dataset in a BI tool such as Power BI or Tableau. For Power BI, this downloads a PBIDS file with a connection to that specific virtual dataset. This would be fine if everything you need is in this one dataset, but if you need to reference multiple virtual datasets, this is a bit annoying. Think of it like connecting to a specific database table instead of to the database in general. You might want to use that table, but you might also want to find other useful tables to combine in your Power BI model. You can open Power BI and connect to Dremio in general and navigate from there with no problems. I’m just pointing out that the buttons in the UI don’t seem that useful.
  • Dremio doesn’t support passthrough authentication on ADLS. All queries to the data lake are made in the context of the Dremio application, not the individual user. This means that you may need to set permissions twice for your data lake if you have other tools directly accessing the data lake instead of using Dremio. The idea is that most tools will connect through Dremio to take advantage of the semantic layer. But it would be nice to have, just to simplify security.

Advice we received in training

  • Unlike with nesting views in SQL Server, it’s ok to create multiple layers of virtual datasets. You want to design the semantic layer (the virtual datasets) to reuse common logic instead of repeating it across multiple views.
  • The standard design pattern for the semantic layer is to have a layer of “staging views” that have a 1-to-1 mapping to physical datasets and very little transformation outside of fixing data types and light cleansing. On top of the Staging layer is the Business layer, which includes virtual datasets containing business logic. The Business layer should handle most of the query workload. On top of the Business layer is the Application Layer. This includes virtual datasets that are purpose-built to support specific applications or reports.
  • Star schemas are not optimal in Dremio. You likely want to denormalize even more than that. This is because it is more expensive to perform a join than to search through a large number of values in a column.
  • When creating a reflection, setting the sort column is somewhat like creating an index in a SQL database. It helps prune data when applying a query filter or performing a join.
  • Reflections can be used to partition data. If you find you have a single large file, you can use a Reflection to split it by a low cardinality value to improve query performance. When you do this, it creates a parquet file per partition.
  • Reflections can be set to use an incremental refresh, but only if the data is additive and existing data is not updated.
  • You don’t need a reflection for everything. Make them as small and reusable a possible.
  • Try to avoid thousands of tiny files, and aim for a few medium to large files (MBs to GBs). This is common for most data lake engines as there is an overhead cost for file enumeration.

Some other thoughts

  • Dremio advertises that you don’t need data integration processes like you would for a data warehouse. I find this to be somewhat inaccurate for two main reasons. First, if you need to acquire data from APIs or other applications to which Dremio can’t connect, you will still need to copy data to your data lake. Second, when you use a Reflection to speed up a query, you are creating a copy of the data in your data lake stored as one or more Parquet files. Data virtualization technology hasn’t actually matured to the point of not needing ETL at all. I can see how Dremio would lessen the need for ETL, but let’s recognize that you’ll probably still need some and that Dremio is doing a bit of data loading of it’s own. So the question becomes where — and with what tools — you would like to do this. You can have Dremio do your transforming and loading in the form of reflections, or you can load your own data already transformed to the data lake. You will likely end up with a bit of both over time.
  • Consider the skillsets of the people who will manage the system, as well as those who will build and query the datasets. If you have a team of admins who only know Windows, they are going to need to skill up on Linux. If your BI team or analysts don’t know SQL, they will probably struggle to build the virtual datasets.
  • This system can get pretty expensive pretty fast (which is true of most big data systems). You’ll want to be sure to automate the shutdown of the nodes in dev and test environments when they are not in use, so you can save a bit of money. And remember that you can size up your nodes later if you find you don’t have adequate performance. Oversizing at the outset will waste money.
  • Dremio is a (well-funded) startup with a product that is built on several open source technologies, and they don’t seem to have a public roadmap. In my experience, they have been good about taking feedback to add to the roadmap and with sharing what is soon to be released. But if you are building your company’s BI strategy with Dremio as a key tool, you probably want more than that. It sounds like they share more with paying customers. I would want that information before making a purchasing decision.
  • Overall, I can see why Dremio has been adopted by several large companies. And I have enjoyed setting up the Azure architecture around it and building virtual datasets. I wish they would add some Azure-specific features to optimize things and make security easy, but it’s a promising platform.

More Information about Dremio

If Dremio sounds interesting to you, here are a few helpful links

This was my first project using Dremio. If you’ve used Dremio, please share your experience in the comments.

Azure, Azure Data Factory, Microsoft Technologies

Azure Data Factory Activity Failures and Pipeline Outcomes

Question: When an activity in a Data Factory pipeline fails, does the entire pipeline fail?
Answer: It depends

In Azure Data Factory, a pipeline is a logical grouping of activities that together perform a task. It is the unit of execution – you schedule and execute a pipeline. Activities in a pipeline define actions to perform on your data. Activities can be categorized as data movement, data transformation, or control activities.

In many instances, when an activity fails during a pipeline run, the pipeline run will report failure as well. But this is not always the case.

There are two main scenarios where an activity would report failure, but the pipeline would report success:

  • The maximum number of retry attempts is greater than 0, and the initial activity execution fails but the second attempt succeeds
  • The failed activity has a failure path or a completion path to a subsequent activity and no success path

Retry Attempts

In the General settings of any activity is a property called Retry. This is the number of times Data Factory can try to execute the activity again if the initial execution fails. The default number of retries is 0. If we execute a pipeline containing one activity with the default Retry setting, the failure of the activity would cause the pipeline to fail.

Data Factory Web UI  showing the General settings of an activity with the Retry property
Data Factory Activity General settings showing the Retry Property

I often set retries to a non-zero number in copy activities, lookups, and data flows in case there are transient issues that would cause a failure that might not be present if we waited 30 seconds and tried the activity again.

Data Factory Monitoring activity runs within a pipeline. An activity failed the first time, was rerun, and succeeded the second time
Output of a Data Factory activity that was executed and initially failed. Since it was set to have 1 retry, it executed again and succeeded. If nothing else in the pipeline failed, the pipeline would report success.

Dependency with a Failure Condition

Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed. If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.

Activity1 has a success path to Activity2. Activity1 failed so Activity2 did not execute.
Because Activity1 failed, Activity2 is not executed and the pipeline fails.

But if we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.

Activity1 has a failure path to Activity2. Activity1 failed and Activity2 succeeded.
The failure dependency means this pipeline reports success.

Now let’s say we have a pipeline with 3 activities, where Activity1 has a success path to Activity2 and a failure path to Activity3. If Activity1 fails and Activity3 succeeds, the pipeline will fail. The presence of the success path alongside the failure path changes the outcome reported by the pipeline, even though the activity executions from the pipeline are the same as the previous scenario.

Activity1 has a success path to Activity2 and a failure path to Activity3. Activity1 failed, Activity2 was skipped, and Activity3 succeeded.
Activity1 fails, Activity2 is skipped, and Activity3 succeeds. The pipeline reports failure.

What This Means for Monitoring

This difference between pipeline and activity status has a few implications of which we should be aware as we monitor our data factories.

If we are using Azure Monitor alerts, we need to understand that setting an alert for pipeline failures doesn’t catch all activity failures. If there is a retry of an activity and the second attempt is successful, there would be an activity failure but no pipeline failure.

Conversely, if we set an alert to notify us of activity failures, and we have a pipeline designed with the try-catch pattern, we might get an alert about an activity failure, but the pipeline would still show success. You would need to look at the status of the activities within the pipeline execution to see the failure of which you were alerted.

For many of my implementations, just setting an alert to notify me when any activity failure occurs is fine. For others, I really only care if the pipeline fails. Sometimes I need to set more specific alerts where I choose only certain activities to monitor for failure.

You could also use the Data Factory SDK to roll your own monitoring solution. If you write PowerShell, C#, or Python, you can retrieve the status of any pipeline or activity run and take subsequent actions based upon the results.

What This Means for Pipeline Design

You may need to add activities to your pipelines to support your monitoring scenarios if you need something more customized than what is offered from Azure Monitor and don’t want to use the SDK.

If you have notification needs that Azure Monitor can’t accommodate, you could add an activity in your pipelines to send an email based upon your desired activity outcomes. You can cause that activity to execute using an activity dependency alone, or by combining it with a variable and an If Condition activity.

There are times where we may need a pipeline to fail even though we are using the try-catch pattern that results in pipeline success. In that case, I add an additional web activity to the end of my pipeline failure path that hits an invalid url like http://throwanerror.  The failure of this activity will cause the pipeline to fail. Keep monitoring and notifications in mind as you design your pipelines so you are alerted as appropriate.

Azure Data Factory Activity and Pipeline Outcomes

To help clarify these concepts I made the below guide to Data Factory activity and pipeline outcomes. Feel free to share it with others. You can download it directly from this link. A text version that should be friendlier for screen readers can be found on this page.

Accessibility, Data Visualization, Microsoft Technologies, Power BI

Zooming In on a Power BI Report

Have you ever tried to use your browser to zoom in on a visual in a Power BI report? If you simply published your report and then zoomed in, you might have experienced something like the video below.

Trying to zoom in on a report that is set to Fit to page can be confusing for users.

With the default settings of the report, when you zoom in, only the menus around the report change. This is because of report responsiveness and the View setting. By default, reports are set to Fit to page. Power BI is refitting the report to the page every time you zoom.

Why would we need to zoom in?

There might be accessibility or compliance reasons to allow people to zoom in. For instance, WCAG 2.1 Success Criterion 1.4.4 states “Except for captions and images of texttext can be resized without assistive technology up to 200 percent without loss of content or functionality.” People with low vision or other vision impairments might benefit from the ability to zoom within a report page.

Another reason might be that a user simply wants to focus on one chart at a time. Power BI does have a Focus mode. Unfortunately, it currently does a poor job of increasing the font sizes on the visual that is in focus, often rendering it unhelpful.

Column chart shown in Focus Mode in Power BI with large bars and tiny text
Power BI visual shown in Focus Mode

Edit: A helpful commenter pointed out that you can zoom in and out while in Focus mode. This works pretty well on many (but not all) visuals.

What Are Our Other Options?

There are a couple of workarounds for users who need to zoom in on visuals.

  1. We can set the report view or teach users to set the report view to Actual size. This then allows the browser zoom to work as anticipated. We probably don’t want to set all our reports to actual size because we would lose valuable screen real estate and diminish the experience for some users who don’t need to zoom. Having the report automatically fit to the user’s screen is usually helpful. But if users can change that setting as they need too, that might be ok. Here’s an example of how that works.
Setting the view on the Power BI report to Actual size allows users to zoom with the browser

2. We can use assistive technology to zoom. Both Windows and MacOS have built-in magnifier functionality. The downside to this is that using it would not satisfy WCAG 2.1 Success Criterion 1.4.4. I think there is still some gray area/lack of expertise as far as how people are making data visualizations WCAG compliant because it’s part text and part image/shape (although it’s not rendered on the page as an image in Power BI). I’m usually more concerned that users get the information they need an have a good experience. But I want to note this in case you are trying to be WCAG compliant and might run into this issue. Here’s an example of using the magnifier in Windows. You can still use the interactivity in the report. And you can change the size of the magnification window and the level of magnification.

The Windows Magnifier allows users to zoom in to part of the report page while retaining interactivity

3. Zooming in on the report page with a touch screen works fine. If users have tablets or laptops with a touch screen, they can use their fingers to zoom and it will behave as expected. Here’s a video that shows that experience.

Those are all the workarounds I’m aware of, but I’m interested to hear how you have worked around this issue. If you have other suggestions please leave them in the comments.

I found an existing idea about increasing the text size within visuals in focus mode on Ideas.PowerBI.com. I’ve added my vote to it, and I hope you’ll do the same.

Azure, Azure Data Lake, Microsoft Technologies, Power BI

Granting ADLS Gen2 Access for Power BI Users via ACLs

It’s common that users only have access to certain folders in an Azure Data Lake Storage container. These permissions are provided not through Azure RBAC (role-based access control) roles but through POSIX-like ACLs (access control lists).

The current Power BI documentation mentions only Azure RBAC roles, but it is possible to connect to a folder with permissions granted through ACLs.

You can manage ACLs through the Azure Storage Explorer application or in the Storage Explorer preview in the Azure Portal. As an example, I have a storage account with the hierarchical namespace enabled. In the container named filesystem1 is a folder called Test. Test contains 3 files, and I want a user to import Categories.csv into Power BI.

Azure Storage Explorer showing the mmldl storage account with filesystem1 selected. The Test folder in filesystem1 is selected and 3 files are shown.
Data lake storage account with files located in a folder called Test

If I select the Test folder and then select Manage Access, I can see that an AAD user named Data Lake User has been granted access and default ACLs. Note that the user needs at least Read and Execute. Write isn’t necessary if they don’t need to change the file.

The Manage Access window in Azure Storage Explorer. The user named Data Lake User is selected. Access and Default permissions are set to give the user Read, Write, and Execute.
Managing access on the Test folder for the Data Lake Access user

But with those permissions on the Test folder, I’m not able to connect to it from Power BI Desktop. If I try, I’ll get an error that says “Access to the resource is forbidden.”

Power BI error that says "Unable to connect. We encountered an error while trying to connect. Details: Access to the resource is forbidden."
Power BI error encountered when a user doesn’t have sufficient permissions to access a file in the data lake

This is because the user is missing some permissions. We need to grant Execute permissions on all parent folders up to the root (the container).

In this case, there is only one level above my Test folder. So I select the filesystem1 container, go to Manage Access, and grant it Execute permissions.

Manage Access window in Azure Storage Explorer showing permissions for Data Lake user on filesystem1. Execute is selected for both Access and Default permissions.
Adding Execute permissions to the parent container

Note that changing the Default ACL on a parent does not affect the access ACL or default ACL of child items that already exist. So if you have existing subfolders and files to which users need access, you will need to grant access at each parent level because the default ACLs won’t apply.

Thanks to Gerhard Brueckl for noting that I needed Execute permissions on parent folders when I got stuck in testing.

If you find yourself hitting that access forbidden message in Power BI when accessing a file in ADLS Gen2, double check the user’s Execute permissions on the parent folders.

Data Visualization, Microsoft Technologies

One Chart at A Time Video Series

Jon Schwabish over at PolicyViz has created great initiative called the One Chart at a Time Video Series. It’s an effort to expand readers’ graphic literacy through short videos explaining how to read and use different charts. Each video is from a different person in the data visualization industry. Participants include people I admire such as Andy Kirk, Ben Jones, and Cole Nussbaumer Knaflic. Jon releases a new video each weekday. The initiative started January 11 and will continue through mid-March.

In each video, the presenter answers 3 questions:

  1. Can you please describe the graph?
  2. Can you please describe any considerations chart makers need to take into account when creating this type of chart?
  3. Can you please share an example of this chart you really like?
One Chart at a Time: Waterfall Charts with Meagan Longoria

I was honored to be asked to participate and talk about waterfall charts. My video (episode 13) has been published! You can check it out below, but I highly recommend you check out all the previous episodes and stay tuned for future episodes.

Links to examples

My video contains a few examples of waterfall charts, and I want to make sure to give credit to these designers for their work here, in addition to noting them in my video.

I hope you watch and enjoy the One Chart at a Time Video Series.

Azure, Azure Data Factory, KQL, Microsoft Technologies

Retrieving Log Analytics Data with Data Factory

I’ve been working on a project where I use Azure Data Factory to retrieve data from the Azure Log Analytics API. The query language used by Log Analytics is Kusto Query Language (KQL). If you know T-SQL, a lot of the concepts translate to KQL. Here’s an example T-SQL query and what it might look like in KQL.

--T-SQL: 
SELECT * FROM dbo.AzureDiagnostics 
WHERE TimeGenerated BETWEEN '2020-12-15 AND '2020-12-16'
AND database_name_s = 'mydatabasename'
//KQL: 
AzureDiagnostics 
| where TimeGenerated between(datetime('2020-12-15') .. datetime('2020-12-16')) 
| where database_name_s == 'mydatabasename'

For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).

You can use an Azure Data Factory copy activity to retrieve the results of a KQL query and land them in an Azure Storage account. You must first execute a web activity to get a bearer token, which gives you the authorization to execute the query.

Data Factory pipeline containing a web activity to get a bearer token and a copy activity to copy data from the Log Analytics API.
Data Factory pipeline that retrieves data from the Log Analytics API.

I had to create an app registration in Azure Active Directory for the web activity to get the bearer token. The web activity should perform a POST to the following url (with your domain populated and without the quotes): "https://login.microsoftonline.com/[your domain]/oauth2/token"

Make sure you have added the appropriate header of Content-Type: application/x-www-form-urlencoded. The body should contain your service principal information and identify the resource as "resource=https://api.loganalytics.io". For more information about this step, see the API documentation.

Data Factory Copy Activity

The source of the copy activity uses the REST connector. The base url is set to "https://api.loganalytics.io/v1/workspaces/[workspace ID]/" (with your workspace ID populated and without the quotes). Authentication is set to Anonymous. Below is my source dataset for the copy activity. Notice that the relative url is set to “query”.

Connection properties of a dataset in Azure Data Factory. The base url points to https://api.loganalytics.io/v1/workspaces/[workspaceid] with the workspace ID not shown. The relative url contains the string "query".
ADF Dataset referencing a REST linked service pointing to the Log Analytics API

The Source properties of the copy activity should reference this REST dataset. The request method should be POST, and the KQL query should be placed in the request body (more on this below).

Two additional headers need to be added in the Source properties.

Additional Headers section of a Data Factory copy activity. Two headers are shown. 1) content-type: application/json; charset=utf-8 2) Authorization: @concat('Bearer ', activity('Get Bearer Token').output.access_token)
Additional headers in the Source properties of the ADF copy activity

The Authorization header should pass a string formatted as “Bearer [Auth Token]” (with a space between the string “Bearer” and the token). The example above retrieves the token from the web activity that executes before the copy activity in the pipeline. Make sure you are securing your inputs and outputs so your secrets and tokens are not being logged in Data Factory. This option is currently found on the General properties of each activity.

Embedding a KQL Query in the Copy Activity

You must pass the KQL query to the API as a JSON string. But this string is already inside the JSON created by Data Factory. Data Factory is a bit picky in how you enter the query. Here is an example of how to populate the request body in the copy activity.

{
"query": "AzureDiagnostics | where TimeGenerated between(datetime('2020-12-15') .. datetime('2020-12-16')) | where database_name_s == 'mydatabasename'" 
}

Note that the curly braces are on separate lines, but the query must be on one line. So where I had my query spread across 3 lines in the Log Analytics user interface as shown at the beginning of this post, I have to delete the line breaks for the query to work in Data Factory.

The other thing to note is that I am using single quotes to contain string literals. KQL supports either single or double quotes to encode string literals. But using double quotes in your KQL and then putting that inside the double quotes in the request body in ADF leads to errors and frustration (ask me how I know). So make it easy on yourself and use single quotes for any string literals in your KQL query.

In my project, we were looping through multiple databases for customized time frames, so my request body is dynamically populated. Below is a request body similar to what I use for my copy activity that retrieves Azure Metrics such as CPU percent and data storage percent. The values come from a lookup activity. In this case, the SQL stored procedure that is executed by the lookup puts the single quotes around the database name so it is returned as ‘mydatabasename’.

{
"query": "AzureMetrics | where TimeGenerated between (datetime(@{item().TimeStart}) .. datetime(@{item().TimeEnd})) | where Resource == @{item().DatabaseName} | project SourceSystem , TimeGenerated , Resource, ResourceGroup , ResourceProvider , SubscriptionId , MetricName , Total , Count , Maximum , Minimum , TimeGrain , UnitName , Type, ResourceId"
}

With dynamically populated queries like the above, string interpolation is your friend. Paul Andrew’s post on variable string interpolation in a REST API body helped me understand this and get my API request to produce the required results.

You can do similar things with Data Factory to query the Application Insights API. In fact, this blog post on the subject helped me figure out how to get the Log Analytics data I needed.

Be Aware of API Limits

There are limits to the frequency and amount of data you can pull from the Log Analytics API. As noted in the API documentation:

  • Queries cannot return more than 500,000 rows
  • Queries cannot return more than 64,000,000 bytes (~61 MiB total data)
  • Queries cannot run longer than 10 minutes (3 minutes by default)

If there is a risk that you may hit the limit on rows or bytes, you need to be aware that the Log Analytics API does not return an error in this case. It will return the results up to the limit and then note the “partial query failure” in the result set. As far as I can tell, there is no option for pagination, so you will need to adjust your query to keep it under the limits. My current process uses a Get Metadata activity after the copy activity to check file sizes for anything close to the limit and then breaks that query into smaller chunks and re-executes it.

It’s All in the Details

I had a lot of trial and error as I worked my way through populating the request body in the API call and dealing with API limits. I hope this helps you avoid some of the pitfalls.

Microsoft Technologies, Power BI, Workout Wednesday

Workout Wednesdays for Power BI in 2021

I’m excited to announce that something new is coming to the Power BI community in 2021: Workout Wednesday!

Workout Wednesday logo

Workout Wednesday started in the Tableau community and is expanding to Power BI in the coming year. Workout Wednesdays present challenges to recreate a data-driven visualization as closely as possible. They are designed to help you improve your skills in Power BI and Tableau.

How You Can Participate

Watch for the Power BI challenge to be published on Wednesdays in 2021. The challenge will contain requirements and a dataset. Use the dataset to create the desired end result.

Then share your workout! You can post your workout to the Data Stories Gallery or your blog, or just share a public link. If you aren’t able to share a public link – perhaps because that option is disabled in your Power BI tenant or you don’t have a Power BI tenant– a gif, a video, or even some screenshots are just fine.

To formally participate: Post to Twitter using both the #WOW2021 and #PowerBI hashtags along with a link/image/video of your workout. Include a link to the challenge on the Workout Wednesday site. And please note the week number in your description, if possible.

Community Growth

I’m looking forward to Workout Wednesdays for a couple of reasons. First, I think Power BI needs more love in the data visualization department. We need to be talking about effective visualization techniques and mature past ugly pie charts and tacky backgrounds. And I think Workout Wednesdays will help us individually grow those skills, but it will also foster more communication and sharing of ideas around data visualization in Power BI. That in turn will lead to more product enhancement ideas and conversations with the Power BI team, resulting in a better product and a stronger community.

Second, I’m also excited to see the crosspollination and cross-platform learning we will achieve by coming together as a data visualization community that isn’t focused on one single tool. There is a lot Tableau practitioners and Power BI practitioners can learn from each other.

Join Me In January

Keep an eye out on Twitter and the Workout Wednesday website for the first challenge coming January 6. While it would be great if you did the workout for every single week, don’t be concerned if you can’t participate every week. A solution will be posted about a week later, but nothing says you can’t go back and do workouts from previous weeks as your schedule allows.

I look forward to seeing all of your lovely Workout Wednesday solutions next year!

Accessibility, Conferences, Microsoft Technologies

Captioning Options for Your Online Conference

Many conferences have moved online this year due to the pandemic, and many attendees are expecting captions on videos (both live and recorded) to help them understand the content. Captions can help people who are hard of hearing, but they also help people who are trying to watch presentations in noisy environments and those who lack good audio setups as they are watching sessions. Conferences arguably should have been providing live captions for the in-person events they previously held. But since captions are finally becoming a wider a topic of concern, I want to discuss how captions work and what to look for when choosing how to caption content for an online conference.

There was a lot of information that I wanted to share about captions, and I wanted it to be available in one place. If you don’t have the time or desire to read this post, there is a summary at the bottom.

Note: I’m not a professional accessibility specialist. I am a former conference organizer and current speaker who has spent many hours learning about accessibility and looking into options for captioning. I’m writing about captions here to share what I’ve learned with other conference organizers and speakers.

Closed Captions, Open Captions, and Subtitles

Closed captions provide the option to turn captions on or off while watching a video. They are usually shown at the bottom of the video. Here’s an example of one of my videos on YouTube with closed captions turned on.

YouTube video with closed captions turned on and the caption text shown along the bottom. The CC button on the bottom has a red line under it indicating it is on.
YouTube video with closed captions turned on. The CC button at the bottom has a red line under it to indicate the captions are on.

The placement of the captions may vary based upon the service used and the dimensions of the screen. For instance, if I play this video full screen on my wide screen monitor, the captions cover some of the content instead of being shown below.

Open captions are always displayed with the video – there is no option to turn them off. The experience with open captions is somewhat like watching a subtitled foreign film.

But despite captions often being referred to colloquially as subtitles, there is a difference between the two. Captions are made for those who are hard of hearing or have auditory processing issues. Captions should include any essential non-speech sound in the video as well as speaker differentiation if there are multiple speakers. Subtitles are made for viewers who can hear and just need the dialogue provided in text form.

For online conferences, I would say that closed captions are preferred, so viewers can choose whether or not to show the captions.

How Closed Captions Get Created

Captions can either be created as a sort of timed transcript that gets added to a pre-recorded video, or they can be done in real time. Live captioning is sometimes called communication access real-time translation (CART).

If you are captioning a pre-recorded video, the captions get created as a companion file to your video. There are several formats for caption files, but the most common I have seen are .SRT (SubRip Subtitle), .VTT (Web Video Text Tracks). These are known as simple closed caption formats because they are human readable – showing a timestamp or sequence number and the caption in plain text format with a blank line between each caption.

Who Does the Captions

There are multiple options for creating captions. The first thing to understand is that captioning is a valuable service and it costs money and/or time.

In general, there are 3 broad options for creating captions on pre-recorded video:

  • Authors or conference organizers manually create a caption file
  • Presentation software creates a caption file using AI
  • A third-party service creates a caption file with human transcription, AI, or a combination of both

Manually creating a caption file

Some video editing applications allow authors to create caption files. For example, Camtasia provides a way to manually add captions or to upload a transcript and sync it to your video.

Alternatively, there is a VTT Creator that lets you upload your video, write your captions with the video shown so you get the timing right, and then output your .VTT file.

Another approach is to use text-to-speech software to create a transcript of everything said during the presentation and then edit that transcript into a caption file.

Services like YouTube offer auto-captioning, so if it’s an option to upload as a private video to get the caption file from there, that is a good start. But you will need to go back through and edit the captions to ensure accuracy with either of these approaches. Vimeo also offers automatic captioning, but the results will also need to be reviewed and edited for accuracy.

These are valid approaches when you don’t have other options, but they can be very time consuming and the quality may vary. This might be ok for one short video, but is probably not ideal for a conference.

If you are going to make presenters responsible for their own captions, you need to provide them with plenty of time to create the captions and suggest low-cost ways to auto-generate captions. I’ve seen estimates that it can take up to 5 hours for an inexperienced person to create captions for one hour of content. Please be aware of the time commitment you are requesting of your presenters if you put this responsibility on them.

Captions in Your Presentation Software

Depending on the platform you use, your presentation software might provide AI-driven live captioning services. This is also known as Automatic Speech Recognition (ASR). For example, Teams offers a live caption service. As of today (November 2020), my understanding is that Zoom, GoToMeeting, and GoToWebinar do not offer built-in live caption services. Zoom allows you to let someone type captions or integrate with a 3rd party caption service. Zoom and GoToMeeting/GoToWebinar do offer transcriptions of meeting audio after the fact using an AI service.

PowerPoint also offers live captioning via its subtitles feature. My friend Echo made a video and blog post to show the effectiveness of PowerPoint subtitles, which you can view here. There are a couple of things to note before using this PowerPoint feature:

  1. It only works while PowerPoint is in presentation mode. If you have demos or need to refer to a document or website, you will lose captions when you open the document or web browser.
  2. If you are recording a session, your subtitles will be open subtitles embedded into your video. Viewers will not be able to turn them off.
  3. The captions will only capture the audio of the presenter who is running the PowerPoint. Other speakers will not have their voice recorded and will not be included in the captions.

Google Slides also offers live captions. The same limitations noted for PowerPoint apply to Google Slides as well.

Third-Party Caption Services

There are many companies that provide captioning services for both recorded and live sessions. This can be a good route to go to ensure consistency and quality. But all services are not created equal – quality will vary. For recorded sessions, you send them video files and they give you back caption files (.VTT, .SRT, or another caption file format). They generally charge you per minute of content. Some companies offer only AI-generated captions. Others offer AI- or human-generated captions, or AI-generated captions with human review. Humans transcribing your content tends to cost more than AI, but it also tends to have a higher accuracy. But I have seen some impressively accurate AI captions. Captions on recorded content are often less expensive than live captions (CART).

Below are a few companies I have come across that offer caption services. This is NOT an endorsement. I’m listing them so you can see examples of their offerings and pricing. Most of them offer volume discount or custom pricing.

  • Otter.ai – offers AI-generated captions for both recorded and live content, bulk import/export, team vocabulary
  • 3PlayMedia – offers AI-generated and human-reviewed captions for recorded content, AI-generated captions for live content. (Their standard pricing is hidden behind a form, but it’s currently $0.60 per minute of live auto-captioning and $2.50 per minute of closed captions for recorded video.)
  • Rev – offers captions for both recorded and live content, shared glossaries and speaker names to improve accuracy.

The Described and Captioned Media Program maintains a list of captioning service vendors for your reference. If you have used a caption service for a conference and want to share your opinion to help others, feel free to leave a comment on this post.

Questions for Conference Organizers to Ask When Choosing a Captioning Vendor

For recorded or live video:

  • What is your pricing model/cost? Do you offer bulk discounts or customized pricing?
  • Where/how will captions be shown in my conference platform? (If it will overlay video content, you need to notify speakers to adjust content to make room for it. But try to avoid this issue where possible.)
  • Is there an accuracy guarantee for the captions? How is accuracy measured?
  • Can I provide a list of names and a glossary of technical terms to help improve the caption accuracy?
  • Does the captioning service support multiple speakers? Does it label speakers’ dialogue to attribute it to the right person?
  • Does the captioning service conform to DCMP or WCAG captioning standards? (Helps ensure quality and usability)
  • How does the captioning service keep my files and information secure (platform security, NDAs, etc.)?
  • What languages does the captioning service support? (Important if your sessions are not all in English)

For recorded video:

  • Does my conference platform support closed captions? (If it doesn’t, then open captions encoded into the video will be required.)
  • What file type should captions be delivered in to be added to the conference platform?
  • What is the required lead time for the captioning service to deliver the caption files?
  • How do I get videos to the caption service?

For captions on live sessions:

  • Does the live caption service integrate with my conference/webinar platform?
  • How do I get support if something goes wrong? Is there an SLA?
  • What is the expected delay from the time a word is spoken to when it appears to viewers?

Further Captioning Advice for Conference Organizers

  • Budget constraints are real, especially if you are a small conference run by volunteers that doesn’t make a profit. Low quality captions can be distracting, but no captions means you have made a decision to exclude people who need captions. Do some research on pricing from various vendors, and ask what discounts are available. You can also consider offering a special sponsorship package where a sponsor can be noted as providing captions for the conference.
  • If you are running a large conference, this should be a line item in your budget. Good captions cost money, but that isn’t an excuse to go without them.
  • If your conference includes both live and recorded sessions, you can find a vendor that does both. You’ll just want to check prices to make sure they work for you.
  • If your budget means you have to go with ASR, make sure to allow time to review and edit closed captions on recorded video.
  • Try to get a sample of the captions from your selected vendor to ensure quality beforehand. If possible for recorded videos, allow speakers to preview the captions to ensure quality. Some of them won’t, but some will. And it’s likely a few errors will have slipped through that can be caught and corrected by the speakers or the organizer team. This is especially important for deeply technical or complex topics.
  • Make sure you have plenty of lead time for recorded videos. If a speaker is a few days late delivering a video, make sure their video can still be captioned and confirm if there is an extra fee.

Final Thoughts and Recap

If you’d like more information about captions, 3PlayMedia has an Ultimate Guide to Closed Captioning with tons of good info. Feel free to share any tips or tricks you have for captioning conference sessions in the comments.

I’ve summarized the info in this post below for quick reference.

Terms to Know

  • Closed captions: captions that can be turned on and off by the viewer
  • Open captions: captions that are embedded into the video and cannot be turned off
  • CART: communication access real-time translation, a technical term for live captioning
  • ASR: automatic speech recognition, use of artificial intelligence technology to generate captions
  • .SRT and .VTT: common closed caption file formats

Choosing a Captioning Solution for Your Conference

(Click to enlarge)

Diagram summarizing decision points when choosing a captioning solution. For high budget, choose human generated/reviewed captions from a service. For low budget and moderate time, choose ASR captions. For no budget, choose ASR built into presentation/conference software. Otherwise, someone will need to manually create captions. If you can't provide captions, let viewers know in advance.
This diagram represents general trends and common decision points when choosing a captioning solution. Your specific situation may vary from what is shown here

Summary of Caption Solutions

Manual creation of caption files for recorded sessions
Cost: None
Time/Effort: High
Pros:
• Doesn’t require a third-party integration
• Supports closed captions
• Works no matter what application is shown on the screen
• Works not matter what application is used to record and edit video
Cons:
• Accuracy will vary widely
• Manual syntax errors can cause the file to be unusable

Upload to YouTube, Vimeo or another service that offers free captions
Cost: None to Low
Time/Effort: Medium
Pros:
• Supports closed captions
• Works no matter what application is shown on the screen
• Works no matter what application is used to record and edit video
Cons:
• Not available for live sessions
• Requires editing of captions to achieve acceptable accuracy
• Requires an account with the service and (at least temporary) permission to upload the video
• Accuracy will vary widely

Auto-generated captions in presentation software (e.g., PowerPoint, Google Slides)
Cost: Low
Time/Effort: Low
Pros:
• Works for live and recorded sessions
• No third-party integrations required
Cons:
• Requires that all presenters use presentation software with this feature
• Must be enabled by the presenter
• Won’t work when speaker is showing another application
• Often offers only open captions
• Accuracy may vary
• Often only captures one speaker

ASR (AI-generated) captions from captioning service
Cost: Medium
Time/Effort: Low
Pros:
• Works for live and recorded sessions
• Supports closed captions
• Works no matter what application is shown on the screen
• Works not matter what application is used to record and edit video
Cons:
• Accuracy may vary
• Requires planning to meet lead times for recorded sessions
• Poor viewer experience if delay is too large during live sessions

Human-generated or human-reviewed captions from a captioning service
Cost: High
Time/Effort: Low
Pros:
• Ensures the highest quality with the lowest effort from conference organizers and speakers
• Works for live and recorded sessions
• Works no matter what application is shown on the screen
• Works not matter what application is used to record and edit video
Cons:
• Requires planning to meet lead times for recorded sessions
• Poor viewer experience if delay is too large during live sessions

I hope you find this exploration of options for captions in online conference content helpful. Let me know in the comments if you have anything to add to this post to help other conference organizers.