Azure Data Factory, Logic Apps, Microsoft Technologies

Copying large files from SharePoint Online

I recently worked on a project where we needed to copy some large files from a specified library in SharePoint Online. In that library, there were several layers of folders and many different types of files. My goal was to copy files that had a certain file extension and a file name that started with a specific string.

I began by comparing the capabilities between Azure Data Factory and Logic Apps. Logic Apps allows me to start the process via a webhook. It also has a SharePoint online connector that provides easy AAD authentication. And there is a special operation that allows me to retrieve the properties (path, filename) of files without retrieving the files. This makes it easy to loop through the list of files that is already filtered to only the files I want.

Logic Apps process: 1) When an HTTP request is received 1) Get files (properties only) 3) For each 4a) Success Response 4b) Failure response
Logic App that retrieves file metadata from SharePoint and then performs an action for each file in the list

Azure Data Factory does have a SharePoint Online connector, but you can’t use it to copy files. You must use the HTTP connector to copy files from SharePoint in Data Factory. And that requires you to make a web call first to get a bearer token to be used in the authentication for the source of the copy activity.

While the ADF options work, I thought I would go with the Logic App because it felt cleaner and easier for others to understand. That is until I tried to copy some large files and encountered the following error.

Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'.

It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.

At that point I would have to figure out how to handle the chunking using an HTTP call. That made the Logic App no better than Data Factory for my purposes. And since I already had part of the larger process in Data Factory, I went back there to see how it handled chunking.

Copying a Subset of Files from SharePoint Online Using Data Factory

And it turns out you don’t have to do anything special in Data Factory to handle the chunking of the large files – it does that work for you.

So I set up a similar process to what I had in my Logic App in my ADF pipeline.

Lookup to get file properties with path to For Each loop
ADF pipeline that retrieves a list of files in a SharePoint library and performs an action on each file

First, I used a Lookup activity to get the list of files. When you do a lookup against a SharePoint Online dataset, you can use OData filters to limit the data returned by the lookup. In my case this was necessary because I only wanted 144 of the items out of the list of 4300. Lookups can only return 5,000 rows, so I definitely needed to add filters to my query. Also, folders are considered an item in the data returned, so if you are looking only for files, you need to filter out folders (either by content type or by the name of the file).

Settings for a Lookup activity in ADF. It uses a SharePoint online source dataset. The Use query setting is set to query. The Query setting contains the text $filer=endswith(Name, .bak') and startswith(Name, 'Dama'). First row only is not enabled.
Data Factory Lookup against a SharePoint Online list using an OData query with filters

The results of the lookup, returning one item per file, are fed into a For Each activity. Inside the For Each activity, I placed a Web activity to get the necessary bearer token and a Copy activity to copy the file to blob storage.

Data Factory web activity with a path to a copy activity
Inside the For Each activity is a Web activity to retrieve a bearer token, followed by a Copy activity that copies the files to blob storage.

To get a bearer token, you need a service principal with the correct permissions in SharePoint. Then you make a POST to https://accounts.accesscontrol.windows.net/{tenantID}/tokens/OAuth/2.

You need the following header: Content-Type: application/x-www-form-urlencoded.

In the body of the HTTP call, you must include: grant_type=client_credentials&client_id={clientid}@{tenantid}&client_secret={secretvalue}&resource=00000003-0000-0ff1-ce00-000000000000/{SharePoint tenant}.sharepoint.com@{tenantID}.

Make sure to set Secure Output so your credentials aren’t written to the ADF logs. Preferably, you would store the client ID and client secret in Key Vault and use a web activity to look them up before getting the token.

You can then use the output from the web activity in your copy activity.

ADF Copy Activity Source Settings: Binary dataset. Path and Name are passed as parameters. The request method is GET. The additional header contains the auth token from the web activity.
Source settings for a Copy activity in ADF that copies files from SharePoint

The Additional Headers should be populated with the auth token as shown below:

Authorization: Bearer @{activity('WEB_GetToken').output.access_token}

Since I don’t need to read the file contents, just copy it to blob storage, I used a binary dataset. I parameterized the dataset and linked service so my For Each activity could feed it the path and name of each file. This is because the lookup provides the path separately from the file name (which includes the file extension).

Binary dataset with linked service pointing to the HTTP linked service and parameters for Path and Name. The value for Path is @dataset().Path. The value for Name is @dataset().Name. The relative URL is not populated.
Binary dataset using an HTTP linked service used for retrieve files from SharePoint online

The linked service uses the HTTP connector. It has two parameters added: path and name.

HTTP linked service with parameters for path and name. Authentication is anonymous.
HTTP linked service used to retrieve files in a SharePoint Online library in Data Factory

The base url I used in the linked service is:

@{concat('https://{tenant name}.sharepoint.com/sites/{site name}/_api/web/GetFileByServerRelativeUrl(''',replace(linkedService().Path,' ','%20'),'/',replace(linkedService().Name,' ','%20'),''')/$value')}

If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.

So once I add my destination info to the Copy Activity, I can copy all the files retrieved from my lookup to blob storage.

A Few Tricky Things

Everyone’s SharePoint tenant is set up differently. Your url will vary based upon the configuration of your site collections/sites. When I tried this in my tenant, I didn’t have to include the “site/sitename” before “/_api/web…”. But I did in my client’s tenant when I was building this.

We granted permissions to the service principal at the site collection level to get this to work. If there was a way to grant lower permissions, we couldn’t figure it out. Here the XML we used when the SharePoint admin was granting permissions.

<AppPermissionRequests AllowAppOnlyPolicy="true">  
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web 
   Right="Read" />
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web/list 
   Right="Read" />
</AppPermissionRequests>

Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error. That is why I moved my web activity inside the For Each activity. If you know it’s only a few files and they are quick, you should just get one bearer token and use it for all the files.

Lastly, be sure to set retries and timeouts appropriately. Remember that the default timeout for an activity is 7 days, and that is probably not desirable.

It took a lot of trial and error for me to get this working, as I couldn’t find many blog posts or docs about this. Special thanks to Bill Fellows for helping me through it. I sincerely hope this blog post helps someone avoid my struggles.

Accessibility, Data Visualization, Microsoft Technologies, Power BI

What are those new buttons under tab order in Power BI?

If you’ve visited the Tab order area of the Selection Pane in Power BI in the last couple of months, you might have noticed some new buttons.

The selection pane in Power BI desktop with Tab order selected. There are three buttons underneath the Tab order heading.
Three new buttons for managing tab order in Power BI

The hover text on the first button says “Expand All”. This button is useful if you have grouped visuals. Groups are indicated by a carat to the left of the item in the tab order list.

The selection pane with tab order showing. The fourth item is a group titled Summary Cards. The items within the group are not shown.
Tab order for a report page containing one group

Selecting the Expand all button shows the individual objects within a group.

The selection pane with tab order showing. The fourth item is a group titled Summary Cards. The items within the group are shown in an indented list under the group name.
Tab order with the group expanded

The second button is the Collapse All button. It will collapse the groups so only the name of the group is shown and not the individual objects within the group.

The third button is a great new addition: Have tab order match visual order.

The Selection pane is shown with Tab order selected. The third button has hover text that reads "Have tab order match visual order".
The option to have tab order match visual order is the third button under Tab order.

This button will set the tab order for the visuals on the page to sort ascending by Y and then X coordinates. Let’s look at an example.

I have a report page containing 7 textboxes.

A Power BI report page with a box in the top left corner and a grid of 6 boxes underneath, spanning the entire width of the report. The order of the boxes appears random, but it matches the order in which they were added to the page.
The X,Y coordinates of each box are shown in the box. The original tab order is indicated by the numbers in the circle.

After clicking the button to have tab order match visual order, the tab order is changed shown below.

A Power BI report page with a box in the top left corner and a grid of 6 boxes underneath, spanning the entire width of the report. The order of the boxes matches the Y and X coordinates of each visual, starting at the top left and moving down to the bottom right.
Tab order set with the top left visual being first and the bottom right visual being last

This is often the correct tab order that matches how we read the report visually. This little button can increase keyboard/screen reader accessibility in one second instead of taking a couple of minutes per page.

There will be times that this tab order will not be what you want. Some exceptions might be when you use visuals that have a different amount of space inside the visual container, so the containers are intentionally misaligned (according to the X,Y coordinates) in order for the content to appear visually aligned. Then you might need to customize your tab order a bit. Another exception might be if you have some buttons or links at the top right of the report page that you want a user to visit last (after the content of the report). In that case, you would customize your tab order to make the button last.

But the majority of the time, this option to make tab order match visual order is exactly what you need. I applaud the Power BI team for taking this step to make creating accessible reports a little easier.

Microsoft Technologies, Power BI, Workout Wednesday

Power BI, Maps, and Publish to Web

October 2021 is mapping month over at Workout Wednesday for Power BI. As part of our challenges, we build a sample report and use the Publish to Web functionality to share it on the website. While this has worked well all year, there are some visuals, including maps, that do not support or require a different license for use with Publish to Web.

It’s frustrating to build a Power BI report that you plan to share, only to find that you can’t share it. So I thought it would be helpful to consolidate what I have found about the various map visuals and their support of Publish to Web.

Disclaimer: This information is correct as of October 14, 2021. This could change over time. This is not an exhaustive list of all the map visuals available for Power BI.

Map Visuals

6 map visuals on a power bi report: a bubble map, filled map, shape map, ArcGIS map, Azure Map, Mapbox map.
Examples of the 6 map visuals tested with Publish to Web

Note: I also tested several other AppSource visuals, but they failed to render in Power BI desktop. I may update this post if they start working again.

I hope this helps you plan your visuals when you need to publicly share a report that contains a map.

Excel, Microsoft Technologies, Power BI

Connect Excel to a Power BI Dataset in a Premium Workspace with a B2B User

Power BI offers the ability for users who have access to a dataset in the Power BI service (PowerBI.com) to connect to the dataset using Excel. Normally, this feature is referred to as Analyze in Excel. Once you connect Excel to your dataset, you can create Pivot Table reports or use Cube Functions.

There are currently limitations that mean this functionality isn’t supported for B2B (external) users. An external user is an Azure AD user that is based in another tenant and has been guested into the local AAD tenant. If you go to your dataset in PowerBI.com. choose Analyze in Excel, and then try to open the downloaded file and connect to the dataset, you will be met with connection errors.

But if you have your dataset in a workspace backed by Premium Per User or Premium capacity, you can use the XMLA endpoint to connect, even if you are using a B2B user!

Instead of using the Analyze in Excel functionality, you can connect to your dataset as if it were Analysis Services, using the XMLA endpoint. B2B users just need to make one adjustment to the server name they enter to make this work.

In Excel, locate the Get Data button. Select From Database and then From Analysis Services.

Get Data menu in Excel with the options From Database and From Analysis Services selected.
Connecting to a Power BI dataset using the XMLA endpoint in Excel is done in a similar manner as connecting to an Analysis Services database

Open a browser window and go to the settings of the Power BI workspace that contains the dataset to which you want to connect.

Settings for a Power BI workspace called Demo Reports.
Power BI Premium workspaces of any kind should have the workspace connection string listed in the settings pane

If your workspace is backed by Premium capacity, you will be able to see this in the settings and the workspace connection will be available for you to copy. If you are a member user (not external) you could copy this info into the Server Name box of the Data Connection Wizard and go on your way.

If you are a B2B user, you need to make an adjustment as noted in Microsoft Docs. You need to replace “myorg” in the workspace connection with your primary domain name. If you have access to the Azure portal, you can find the primary domain name on the overview page for the Azure Active Directory.

Overview page in the Azure Portal for Azure Active Directory with the Primary Domain circled under basic information.
The tenant UPN, also called primary domain can be found in the Azure Portal on the AAD overview page

So if the workspace connection from the Power BI service is:
powerbi://api.powerbi.com/v1.0/myorg/Demo%20Reports

And your primary domain is:
mysupercooldomain.com

Then you would change the workspace connection to:

powerbi://api.powerbi.com/v1.0/mysupercooldomain.com/Demo%20Reports

Once you have populated the server name with the workspace connection string, change the logon credentials to “Use the following user name and password” but leave the credentials blank. Once you select the Next button, you will be prompted for your Azure credentials.

Then you will be able to select the desired dataset from the workspace and be on your way to making connected Excel reports.

Azure, Azure Data Factory, DCAC, Microsoft Technologies

Slides and Video from Building a Regret-free Foundation for your Data Factory Now Available

Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.

Here’s our agenda from the presentation.

Slide showing top regrets of data factory users: Poor resource organization in Azure
Lack of naming conventions
Inappropriate use of version control
Tedious, manual deployments
No/inconsistent key vault usage
Misunderstanding integration runtimes
Underutilizing parameterization
Lack of comments and documentation
No established pipeline design patterns
List of top regrets from Data Factory users that they wish they had understood from the beginning

If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.

If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.

I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.

Data Visualization, Microsoft Technologies, Power BI

CDOT Bar Chart Makeover

As I was browsing Twitter today, I noticed a tweet from the Colorado Department of Transportation about their anti-DUI campaign. Shown below, it contains a bar chart that appears to have been presented in PowerPoint.

Tweet from CDOT containing a poorly designed bar chart

There are some easy opportunities to improve the readability of this chart, so I thought I would use it as an example of how small improvements can have a big impact on a fairly simple chart. I recreated the chart (as best I could) in Power BI and then made two revised versions.

Bar chart with 5 bars representing suspected impaired driving fatalities per year. The top left contains 3 logos. The title is on the top right. Each bar is a different bright color. The axis labels are very small.
My Power BI version of the original chart
Redesigned bar chart that removes the thick orange line, moves the title to the top left, and uses only one color across the bars.
My 10-minute makeover that increased readability
Bar chart with red bars and an annotation noting the upward trend from 2019 to 2020. A gray placeholder with a question mark has been added for 2021 with the note "Don't become part of this statistic."
My additional iteration that applied a clearer message and focus

Especially when making data visualizations for the general public —and especially when you want to get people’s engagement on social media— you need to reduce perceived cognitive load. Otherwise, people won’t even bother to read your chart. If your chart feels too busy or too complicated, many people in your intended audience will feel it is not worth the effort to even try to read it and will move on down their Twitter feed to the next Anakin and Padme meme.

Watch the video below for all of the details.

21-minute video showing how to improve upon a bar chart found on Twitter using Power BI
Azure, Azure Data Factory, Azure SQL DB, Microsoft Technologies, PowerShell

Thoughts on Unique Resource Names in Azure

Each resource type in Azure has a naming scope within which the resource name must be unique. For PaaS resources such as Azure SQL Server (server for Azure SQL DB) and Azure Data Factory, the name must be globally unique within the resource type. This means that you can’t have two data factories with the same name, but you can have a data factory and a SQL server with the same name. Virtual machine names must be unique within the resource group. Azure Storage accounts must be globally unique. Azure SQL Databases should be unique within the server.

Since Azure allows you to create a data factory and a SQL server with the same resource name, you may think this is fine. But you may want to avoid this, especially if you plan on using system-defined managed identities or using Azure PowerShell/CLI. And if you aren’t planning on using these things, you might want to reconsider.

I ran into this issue of resources with the same name in a client environment and then recreated it in my Azure subscription to better understand it.

I already had a data factory named adf-deploydemo-dev so I made an Azure SQL server named adf-deploydemo-dev and added a database with the same name.

A data factory named adf-deploymentdemo-dev, a SQL Server named adf-deploymentdemo-dev, and a database named adf-deploymentdemo-dev
A data factory, a SQL Database, and a SQL Server all with the same name in the same region and same resource group

Azure Data Factory should automatically create its system-assigned managed identity. It will use the resource name for the name of the service principal. When you go to create a linked service in Azure Data Factory Studio and choose to use Managed Identity as the authentication method, you will see the name and object ID of the managed identity.

Managed identity name: adf-deploymentdemo-dev. Managed identity object ID: 575e8c6e-dfe6-4b5f-91be-40b0f0b9643b
Information shown in my data factory when creating a linked service for a storage account.

For the Azure SQL Server, we can create a managed identity using PowerShell. The Set-AzSqlServer cmdlet has an -AssignIdentity parameter, which creates the system-assigned managed identity.

Executing PowerShell command: Set-AzSqlServer -AssignIdentity -ResourceGroupName 'ADFDemployDemoDev' -ServerName 'adf-deploydemo-dev'
Executing the PowerShell command to create a managed identity

If you use Get-AzSqlServer to retrieve the information and assign the Identity property to a variable, you can then see the system-assigned managed identity and its application ID.

Executing PowerShell command: $S = Get-AzSqlServer -ResourceGroupName 'ADFDemployDemoDev' -ServerName 'adf-deploydemo-dev'
$S.Identity
The results show principalID, Type, and TenantID
Verifying the managed identity is in place for an Azure SQL server.

Now when I look in Active Directory, I can see both managed identities have the same name but different application IDs and object IDs.

Two managed identities in AAD, both called adf-deploymentdeo-dev.
Two managed service principals used for managed identities that have the same name but different IDs

Everything is technically working right now, but I have introduced some needless ambiguity that can cause misunderstandings and issues.

Let’s say that I want to grant the Storage Blob Data Reader role to my data factory. I go to the storage account, choose to add a role assignment, select the role, and then go to add members. This is what I see:

The user interface to select members to add to a role assignment shows users and service principals by name, so ti contains two objects named adf-deploydemo-dev
Which managed identity belongs to the data factory?

Or let’s say that I use PowerShell to get lists of resources by name. I may be locating resources to add tags, add a resource lock, or move the resource to another region or resource group.

Executing PowerShell command Get-AzResource - Name 'adf-deploydemo-dev' | ft
Getting resources by name returns all three resources

If I don’t specify the resource type, I will get my data factory, my database, and my server in the results. You may be saying “Well, I would always specify the type.” Even if that is true, are you sure all coworkers and consultants touching your Azure resources would do the same?

Why introduce this ambiguity when there is no need to do so?

There are some good tips in the Cloud Adoption Framework in Microsoft Docs about naming conventions. Your organization probably wants to decide up front what names are acceptable and then use Azure Policy as well as good processes to ensure adherence to your defined conventions. And if I were the consultant advising you, I would suggest that resources within your tenant be unique across resource types. The suggestion in Docs is to use a resource type abbreviation at the beginning of your resource name. That would avoid the issue I have demonstrated above. Naming conventions should be adjusted to your organization’s needs, but the ones suggested in Docs are a good place to start if you need some help. It is beneficial to have some kind of resource naming convention beyond just whatever is allowed by Azure.

DAX, Microsoft Technologies, Power BI, Power Query

Calculating Age in Power BI

In week 26 of Workout Wednesday for Power BI, I asked people to calculate the age of Nobel laureates at the time they received the award. I provided some logic, but I didn’t prescribe how to create the age calculation. This inspired a couple of questions and a round of data validation as calculating age may be trickier than you think. In this post, I’ll explore some of the ways people have calculated age in Power BI and the edge cases where those calculations may not work.

In my solution video for Workout Wednesday, I used Power Query to calculate age. This was inspired by several blog posts and videos I had seen previously. There is an Age menu option in the Power Query editor under Date.

Calculating Age with the Power Query Editor user interface

When you select a date column and use that Age option, it calculates the duration between the selected date and the current date in days. You must then replace the current date with the second date column. Next you can choose Total Years under Duration, which divides the days by 365. Finally, you must round that number down to the next integer to get years.

If you follow Ruth’s video, you can do all of that in one step that creates a custom column with the final age value.

 Number.RoundDown(Duration.TotalDays([Date2] - [Date1])/365) 

That is the most common option in Power Query as there is no DateDiff function.

There are a few options for calculating age in DAX. Some people use the DATEDIFF function.

Age DateDiff = DATEDIFF([Date1],[Date2],YEAR) 

Another way I have seen is to use YEARFRAC function.

Age YearFrac = INT ( YEARFRAC ( [Date1], [Date2], 1 ) )

The way Marco Russo suggests is to use QUOTIENT.

Age Quotient (DAX): 

Age Quotient = 

VAR Birthdate = [Date1]

VAR ThisDay = [Date2]

VAR IntBirthdate = YEAR ( Birthdate ) * 10000 + MONTH ( Birthdate ) * 100 + DAY ( Birthdate )

VAR IntThisDay = YEAR ( ThisDay ) * 10000 + MONTH ( ThisDay ) * 100 + DAY ( ThisDay )

VAR Age = QUOTIENT ( IntThisDay - IntBirthdate, 10000 )

VAR CheckedAge = DIVIDE ( Age, NOT ISBLANK ( Birthdate ) )

RETURN

    CheckedAge

As Marco points out, many people were using YEARFRAC, but there is a bug in the DAX implementation that causes it to occasionally return an incorrect answer for this purpose.

Checking the Numbers

I created a Power BI file to demonstrate the differences in these four calculations. You can download the file here. The image below displays the results in several tests. For each row, I’m using Date1 as the birthdate and Date2 as the “as of” date. You’ll notice that I focused on leap years for a few cases.

Table in Power BI with 10 date ranges showing the results from the four calculations. 6 of the 10 rows have different results across the calculations.
Example date ranges and result of the four age calculations

There are six of ten date ranges that have different results across the different calculation methods.

In the second row, the Power Query age calculation says that Feb 29 to Feb 28 in the following year is a full year. This may or may not be what you want depending on your requirements. I’m noting the difference so you can be aware. A similar thing occurs in the fifth row going from Feb 29, 2016 to Feb 28, 2020, and again on the 9th row going from March 1, 2019 to Feb 29, 2020.

On the third row, notice that the DAX DATEDIFF function calculates Feb 29 to Feb 27 of the following year to be a full year, despite it being a day or two short. Depending on what you do with leap years, you might consider Feb 29 to Feb 28 in the following year to be a full year, but that third row result means DATEDIFF is probably not the calculation I want. We see a similar result going from March 1 to Feb 28 of the following year.

YEARFRAC calculates that Feb 29 to Feb 28 in the following year is not a full year, which may be desirable. But it counts Feb 29, 2016 to Feb 29, 2020 as only three years. And we see that March 1, 2000 to March 1, 2021 is only counted as 20 years. So even without starting on a leap year, we get some incorrect results. Small numbers seem to be correct until it gets to about 13 years.

Using the QUOTIENT function provides what I consider to be the most correct results. It calculates Feb 29 to Feb 28 of the following year to be less than a year. It calculates Feb 29, 2016 to Feb 28, 2016 to be three years and not four. And it calculates March 1 to Feb 29 of the following year to be less than a year.

Which to use?

The QUOTIENT formula produces the most accurate results if you don’t want Feb 29 to Feb 28 the next year to be counted as a year. DATEDIFF and YEARFRAC produce too many incorrect results for me to ever suggest using them. Since there is a DAX option that produces more correct answers, I would just go for QUOTIENT instead of either of these two.

UPDATE: There is a better alternative! Imke Feldmann reminded me that there is an Number.IntegerDivide function in Power Query. So let’s take the logic from Marco’s DAX calculation and move it to Power Query:

(BirthDate as date, EndDate as date) =>
let
BirthDateInt = Date.Year(BirthDate)10000 + Date.Month(BirthDate)100 + Date.Day(BirthDate),
EndDateInt = Date.Year(EndDate)10000 + Date.Month(EndDate)100 + Date.Day(EndDate),
Age = Number.IntegerDivide((EndDateInt - BirthDateInt),10000)
in Age

The Power Query custom column created by invoking this function should produce better compression than a DAX calculated column. This might not be significant for a small dataset, but we should be efficient when we can.

Azure, Azure SQL DB, Microsoft Technologies, T-SQL

Altering a Computed Column in a Temporal Table in Azure SQL

System-versioned temporal tables were introduced in SQL Server 2016. They provide information about data stored in the table at any point in time by storing an effective dated version of each row rather than only the data that is correct at the current time

You can alter a temporal table to add or change columns, but you must first turn off system versioning. Let’s look at an example.

CREATE TABLE [dbo].[DatabaseSize](
	 [DatabaseID] [varchar](200) NOT NULL 
	,[ServerName] [varchar](100) NOT NULL
	,[DatabaseName] [varchar](100) NOT NULL
	,[SizeBytes] [bigint] NULL
	,[SizeMB]  AS ([SizeBytes]/(1048576))
	,[ValidFrom] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL
	,[ValidTo] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL
	,PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
	,CONSTRAINT PK_DatabaseSize_DatabaseID PRIMARY KEY CLUSTERED (DatabaseID)
) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));

Temporal tables must have a primary key defined. They also must contain two datetime2 columns, declared as GENERATED ALWAYS AS ROW START / END. The statement above creates both the current table and a history table.

The history table has the same schema as the current table, with one difference: the SizeMB column in the history table is not a computed column.

The dbo.DatabaseSize table is a system-versioned table. The DatabaseSizeHistory table is the related history table. DatabaseSizeHistory contains the same columns as DatabaseSize, except the SizeMB column is not a computed column in the history table.


When I initially created the table, I typoed the formula in the computed column. You can’t alter a computed column — you must drop and recreate the column. This is no problem, just turn off system versioning and alter your table, and turn system versioning back on.

But if you try this without specifying your history table, you will find that it stops using the history table created earlier and makes a new history table.

dbo.DatabaseSize is a system versioned table. The history table now shows as dbo.MSSL_TemporalHistoryFor_1909581841

If you specify your history table while turning system versioning back on, you will encounter an error:

Setting SYSTEM_VERSIONING to ON failed because column 'SizeMB' at ordinal 5 in history table 'Test.dbo.DatabaseSizeHistory' has a different name than the column 'ValidFrom' at the same ordinal in table 'Test.dbo.databasesize'.

Temporal tables match the columns between the current table and history table not only by name and data type but by the column’s ordinal position. Dropping and adding the computed column changed its order as it was added to the end of the table.

You can change the column order of a table in the SQL Server Management Studio UI by right-clicking on the table, selecting Design, and then dragging the column to the correct position. Note that you cannot do this on the system-versioned table while system versioning is on. You can either change the column order on the history table, or turn system versioning off and then change the current table.

dragging the SizeMB column to the bottom of the columns list in the table.

Once the column orders match, you can turn system versioning back on and specify the the history table.

ALTER TABLE [dbo].[DatabaseSize]
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));

This time, the command should complete successfully. You’ll want to drop any unused auto-created history tables before you’re finished.

Azure, Azure Data Factory, Microsoft Technologies

Control Flow Limitations in Data Factory

Control Flow activities in Data Factory involve orchestration of pipeline activities including chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline. They also include custom-state passing and looping containers.

The activities list in the ADF Author & Manage app, showing Lookup, Set variable, Filter, For Each, Switch, and more.
Control Flow activities in the Data Factory user interface

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.

  1. You cannot nest For Each activities.
    Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.
  2. You cannot put a For Each activity or Switch activity inside of an If activity.
    The Data Factory UI will prevent you from doing this by removing the For Each and Switch from the activity list. You can redesign the pipeline to put the inner activity inside a child pipeline. Also note that you can put an If activity inside of a For Each activity.
  3. You cannot use a Set Variable activity inside a For Each activity that runs in parallel.
    The Data Factory UI won’t stop you, but you’ll quickly learn that the scope of the variable is the pipeline and not the For Each or any other activity. So you’ll just overwrite the value in no particular order as the activities execute in parallel. The workaround for this is specific to your use case. You might try using an existing attribute of the item you are iterating on instead of setting a variable. Append Variable works fine, since each loop could add a value. But again, don’t count on the order being meaningful.
  4. You cannot nest If activities.
    The Data Factory UI will prevent you from nesting the If activities. If you need to have two sets of conditions, you can either combine conditions or move the inner condition to a child pipeline.
  5. You cannot nest Switch activities.
    Similar to the If activity, the Data Factory UI will prevent you from nesting Switch activities. And again, you can either combine conditions or move the inner condition to a child pipeline.
  6. You cannot put a For Each or If activity inside a Switch activity.
    The Data Factory UI will prevent you from doing this. You can move the inner activity to a child pipeline if needed.
  7. You cannot use an expression to populate the pipeline in an Execute Pipeline activity.
    It would be great to design a truly dynamic pipeline where you could have a dataset that defines which pipelines to execute, but you can’t do that natively in the Data Factory UI. The Invoked Pipeline property doesn’t allow dynamic expressions. If you need to dynamically execute pipelines, you can use Logic Apps or Azure Functions to execute the pipeline.
  8. You cannot dynamically populate the variable name in Set Variable and Append Variable activities.
    The Data Factory UI only allows you to choose from a list of existing variables. As a workaround, you could use an If activity to determine which variable you will populate.
  9. The Lookup activity has a maximum of 5,000 rows and a maximum size of 4 MB.
    If you need to iterate over more than 5000 rows, you’ll need to split your list between a child and parent pipeline.

In addition to the items mentioned above, also note these resource limits listed in Microsoft Docs. Limits like 40 activities per pipeline (including inner activities for containers) can bite you if you aren’t careful about implementing a modular design. And if you do have a modular design with lots of pipelines calling other pipelines, be aware that you are limited to 100 queued runs per pipeline and 1,000 concurrent pipeline activity runs per subscription per Azure Integration Runtime region. I don’t hit these limits too often, but I have hit them.

This is not to say you can’t create good solutions in Azure Data Factory—you absolutely can. But Data Factory has some limitations that you might not expect if you have experience working with other data integration/orchestration tools.

Have you hit any other limits that caused you to design your pipelines differently that you would like to share with others? If so, leave me a comment.