Azure, Azure Data Factory, KQL, Microsoft Technologies

Looking at Activity Queue Times from Azure Data Factory with Log Analytics

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.

Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

You can see queue time in the ADF Monitor by looking at the output of an activity.

Azure Data Factory pipeline execution details in the ADF monitor. The activity output is open, showing the resulting JSON. There is a property called durationInQueue, which contains the queue time in seconds. The result shown is 2 seconds.
Output from a stored procedure activity showing 2 seconds of queue time

But what if you want to see activity queue times across activities, across pipelines, or even across data factories? Then you need to output your logs to somewhere that makes this easier.

Send ADF Logs to Log Analytics

You can output your ADF logs to a storage account, to Log Analytics, to an event hub, or to a partner solution. I prefer Log Analytics because it’s easy to query and look for trends using KQL.

To configure the output to Log Analytics, you must create a Log Analytics workspace (if you don’t have an existing one) and add a diagnostic setting to the data factory resource. Once you have data feeding into Log Analytics, you can query it.

If you choose resource-specific destination tables in the diagnostic setting, you will find a table in Log Analytics called ADFActivityRun. This table contains a column called Output. The Output column contains the JSON we see in the ADF Studio Monitor app.

KQL has functions for parsing JSON and retrieving only the JSON objects I want to include. This means that I could write a query like the following.

ADFActivityRun
| extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue',Output, typeof(int))
| where Status == 'Succeeded'
| where queuetime > 0
| project Start, End, PipelineName, ActivityType, ActivityName, dur = datetime_diff('second', End, Start), queuetime, PipelineRunId, ActivityRunId
| sort by queuetime desc

This query gives me a list of activities with successful executions that have queue times greater than zero. I can choose from any columns in the ADFActivityRun table, including the pipeline and activity names, start and end times, activity types, and run IDs. Duration is not an available column so I had to calculate it by calculating the difference between the start and end time. The queue time is buried in the JSON in the Output column, so I used the extractjson function to get the duration in queue value.

Now that I know how to get the queue duration, I can look for trends across various slices of data. A query to get average queue time by activity type might look like the below.

ADFActivityRun
| where Status == 'Succeeded'
| where startofday(Start) == datetime(2022-01-04)
| extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue', Output, typeof(int))
| summarize avg_queuetime = avg(queuetime) by ActivityType
| sort by avg_queuetime desc

In this query, I am retrieving activities with successful executions that started on January 4, 2022. I added my calculation to retrieve queue time. Then I calculated average queue time by activity type using the summarize operator and sorted the result descending by queue time.

I could have filtered on any other available activity: pipeline name, activity name, integration runtime, resource ID, and more.

This information is available via the API as well, but a Log Analytics workspace can be spun up and running in minutes without having to write code to connect.

If you are used to writing SQL, the transition to KQL isn’t too bad. Check out the SQL to Kusto query translation page in Microsoft Docs.

Azure, Azure Data Factory, Microsoft Technologies

When You Can’t Change the Connected Git Repo on ADF

I was working on an Azure Data Factory project for a client who is new to ADF, and there was a miscommunication about the new Git Repo to be used for source control. Someone had created a new project and repo instead of using the existing one created for this purpose. This isn’t a big deal, as it’s easy enough to change in ADF Studio.

The Git Configuration page in Azure Data Factory Studio shows the connected repository and has a button to disconnect from the repository.
The Management Hub in Azure Data Factory Studio contains the Git configuration settings

In the Management Hub, you can change the Git configuration for the data factory, and there is a button near the top to disconnect the repo. You may need to do this if you run into conflicts in the publish branch or when you need to change repos.

It should be as easy as that, but I ran into a situation where it wasn’t.

Disconnect Button Unavailable

When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.

I asked my client to disconnect the repo and moved on with the project, but I also logged some feedback for the Data Factory team. You can do this by selecting the Feedback button near the top right of the page in Azure Data Factory Studio. I have done this a few times and always received a prompt response.

The feedback button in ADF Studio is selected. It opens a new dialog that allows you to choose from two options: "I have feedback" and "I have a feature suggestion". From there, you can enter your email address and your feedback.
You can send feedback from within Azure Data Factory Studio

I sent my feedback and received an email response within a couple of days. It contained a few helpful bits of information.

To disconnect git from the ADF management tab, you must be in git mode and be able to access the repository. If you are not able to access the repository but have permissions to update the factory, you can remove the git configuration via the REST API. Here is documentation on this API call: Factories – Configure Factory Repo – REST API (Azure Data Factory) | Microsoft Docs. In this case, repoConfiuration should be set to null in the request body, and the rest of the PUT body should be the same as the existing settings.

Alternatively, the git connection can be removed from the ADF management tab by either gaining access to the repository or having another ADF user with access remove the configuration. I also understand that this is not an ideal experience, so I have filed a work item to remove the requirement to have access to the repository.

This identified my situation exactly. I did not have access to the new repo that had been connected to the data factory. That caused me to be unable to disconnect it.

Hopefully the work item that removes the requirement for repo access before you can disconnect the data factory will be completed soon, but if you run into this issue, you can resolve it on your own using the API or a colleague with repo access.

Azure Data Factory, Logic Apps, Microsoft Technologies

Copying large files from SharePoint Online

I recently worked on a project where we needed to copy some large files from a specified library in SharePoint Online. In that library, there were several layers of folders and many different types of files. My goal was to copy files that had a certain file extension and a file name that started with a specific string.

I began by comparing the capabilities between Azure Data Factory and Logic Apps. Logic Apps allows me to start the process via a webhook. It also has a SharePoint online connector that provides easy AAD authentication. And there is a special operation that allows me to retrieve the properties (path, filename) of files without retrieving the files. This makes it easy to loop through the list of files that is already filtered to only the files I want.

Logic Apps process: 1) When an HTTP request is received 1) Get files (properties only) 3) For each 4a) Success Response 4b) Failure response
Logic App that retrieves file metadata from SharePoint and then performs an action for each file in the list

Azure Data Factory does have a SharePoint Online connector, but you can’t use it to copy files. You must use the HTTP connector to copy files from SharePoint in Data Factory. And that requires you to make a web call first to get a bearer token to be used in the authentication for the source of the copy activity.

While the ADF options work, I thought I would go with the Logic App because it felt cleaner and easier for others to understand. That is until I tried to copy some large files and encountered the following error.

Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'.

It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.

At that point I would have to figure out how to handle the chunking using an HTTP call. That made the Logic App no better than Data Factory for my purposes. And since I already had part of the larger process in Data Factory, I went back there to see how it handled chunking.

Copying a Subset of Files from SharePoint Online Using Data Factory

And it turns out you don’t have to do anything special in Data Factory to handle the chunking of the large files – it does that work for you.

So I set up a similar process to what I had in my Logic App in my ADF pipeline.

Lookup to get file properties with path to For Each loop
ADF pipeline that retrieves a list of files in a SharePoint library and performs an action on each file

First, I used a Lookup activity to get the list of files. When you do a lookup against a SharePoint Online dataset, you can use OData filters to limit the data returned by the lookup. In my case this was necessary because I only wanted 144 of the items out of the list of 4300. Lookups can only return 5,000 rows, so I definitely needed to add filters to my query. Also, folders are considered an item in the data returned, so if you are looking only for files, you need to filter out folders (either by content type or by the name of the file).

Settings for a Lookup activity in ADF. It uses a SharePoint online source dataset. The Use query setting is set to query. The Query setting contains the text $filer=endswith(Name, .bak') and startswith(Name, 'Dama'). First row only is not enabled.
Data Factory Lookup against a SharePoint Online list using an OData query with filters

The results of the lookup, returning one item per file, are fed into a For Each activity. Inside the For Each activity, I placed a Web activity to get the necessary bearer token and a Copy activity to copy the file to blob storage.

Data Factory web activity with a path to a copy activity
Inside the For Each activity is a Web activity to retrieve a bearer token, followed by a Copy activity that copies the files to blob storage.

To get a bearer token, you need a service principal with the correct permissions in SharePoint. Then you make a POST to https://accounts.accesscontrol.windows.net/{tenantID}/tokens/OAuth/2.

You need the following header: Content-Type: application/x-www-form-urlencoded.

In the body of the HTTP call, you must include: grant_type=client_credentials&client_id={clientid}@{tenantid}&client_secret={secretvalue}&resource=00000003-0000-0ff1-ce00-000000000000/{SharePoint tenant}.sharepoint.com@{tenantID}.

Make sure to set Secure Output so your credentials aren’t written to the ADF logs. Preferably, you would store the client ID and client secret in Key Vault and use a web activity to look them up before getting the token.

You can then use the output from the web activity in your copy activity.

ADF Copy Activity Source Settings: Binary dataset. Path and Name are passed as parameters. The request method is GET. The additional header contains the auth token from the web activity.
Source settings for a Copy activity in ADF that copies files from SharePoint

The Additional Headers should be populated with the auth token as shown below:

Authorization: Bearer @{activity('WEB_GetToken').output.access_token}

Since I don’t need to read the file contents, just copy it to blob storage, I used a binary dataset. I parameterized the dataset and linked service so my For Each activity could feed it the path and name of each file. This is because the lookup provides the path separately from the file name (which includes the file extension).

Binary dataset with linked service pointing to the HTTP linked service and parameters for Path and Name. The value for Path is @dataset().Path. The value for Name is @dataset().Name. The relative URL is not populated.
Binary dataset using an HTTP linked service used for retrieve files from SharePoint online

The linked service uses the HTTP connector. It has two parameters added: path and name.

HTTP linked service with parameters for path and name. Authentication is anonymous.
HTTP linked service used to retrieve files in a SharePoint Online library in Data Factory

The base url I used in the linked service is:

@{concat('https://{tenant name}.sharepoint.com/sites/{site name}/_api/web/GetFileByServerRelativeUrl(''',replace(linkedService().Path,' ','%20'),'/',replace(linkedService().Name,' ','%20'),''')/$value')}

If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.

So once I add my destination info to the Copy Activity, I can copy all the files retrieved from my lookup to blob storage.

A Few Tricky Things

Everyone’s SharePoint tenant is set up differently. Your url will vary based upon the configuration of your site collections/sites. When I tried this in my tenant, I didn’t have to include the “site/sitename” before “/_api/web…”. But I did in my client’s tenant when I was building this.

We granted permissions to the service principal at the site collection level to get this to work. If there was a way to grant lower permissions, we couldn’t figure it out. Here the XML we used when the SharePoint admin was granting permissions.

<AppPermissionRequests AllowAppOnlyPolicy="true">  
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web 
   Right="Read" />
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web/list 
   Right="Read" />
</AppPermissionRequests>

Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error. That is why I moved my web activity inside the For Each activity. If you know it’s only a few files and they are quick, you should just get one bearer token and use it for all the files.

Lastly, be sure to set retries and timeouts appropriately. Remember that the default timeout for an activity is 7 days, and that is probably not desirable.

It took a lot of trial and error for me to get this working, as I couldn’t find many blog posts or docs about this. Special thanks to Bill Fellows for helping me through it. I sincerely hope this blog post helps someone avoid my struggles.

Azure, Azure Data Factory, DCAC, Microsoft Technologies

Slides and Video from Building a Regret-free Foundation for your Data Factory Now Available

Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.

Here’s our agenda from the presentation.

Slide showing top regrets of data factory users: Poor resource organization in Azure
Lack of naming conventions
Inappropriate use of version control
Tedious, manual deployments
No/inconsistent key vault usage
Misunderstanding integration runtimes
Underutilizing parameterization
Lack of comments and documentation
No established pipeline design patterns
List of top regrets from Data Factory users that they wish they had understood from the beginning

If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.

If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.

I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.

Azure, Azure Data Factory, Azure SQL DB, Microsoft Technologies, PowerShell

Thoughts on Unique Resource Names in Azure

Each resource type in Azure has a naming scope within which the resource name must be unique. For PaaS resources such as Azure SQL Server (server for Azure SQL DB) and Azure Data Factory, the name must be globally unique within the resource type. This means that you can’t have two data factories with the same name, but you can have a data factory and a SQL server with the same name. Virtual machine names must be unique within the resource group. Azure Storage accounts must be globally unique. Azure SQL Databases should be unique within the server.

Since Azure allows you to create a data factory and a SQL server with the same resource name, you may think this is fine. But you may want to avoid this, especially if you plan on using system-defined managed identities or using Azure PowerShell/CLI. And if you aren’t planning on using these things, you might want to reconsider.

I ran into this issue of resources with the same name in a client environment and then recreated it in my Azure subscription to better understand it.

I already had a data factory named adf-deploydemo-dev so I made an Azure SQL server named adf-deploydemo-dev and added a database with the same name.

A data factory named adf-deploymentdemo-dev, a SQL Server named adf-deploymentdemo-dev, and a database named adf-deploymentdemo-dev
A data factory, a SQL Database, and a SQL Server all with the same name in the same region and same resource group

Azure Data Factory should automatically create its system-assigned managed identity. It will use the resource name for the name of the service principal. When you go to create a linked service in Azure Data Factory Studio and choose to use Managed Identity as the authentication method, you will see the name and object ID of the managed identity.

Managed identity name: adf-deploymentdemo-dev. Managed identity object ID: 575e8c6e-dfe6-4b5f-91be-40b0f0b9643b
Information shown in my data factory when creating a linked service for a storage account.

For the Azure SQL Server, we can create a managed identity using PowerShell. The Set-AzSqlServer cmdlet has an -AssignIdentity parameter, which creates the system-assigned managed identity.

Executing PowerShell command: Set-AzSqlServer -AssignIdentity -ResourceGroupName 'ADFDemployDemoDev' -ServerName 'adf-deploydemo-dev'
Executing the PowerShell command to create a managed identity

If you use Get-AzSqlServer to retrieve the information and assign the Identity property to a variable, you can then see the system-assigned managed identity and its application ID.

Executing PowerShell command: $S = Get-AzSqlServer -ResourceGroupName 'ADFDemployDemoDev' -ServerName 'adf-deploydemo-dev'
$S.Identity
The results show principalID, Type, and TenantID
Verifying the managed identity is in place for an Azure SQL server.

Now when I look in Active Directory, I can see both managed identities have the same name but different application IDs and object IDs.

Two managed identities in AAD, both called adf-deploymentdeo-dev.
Two managed service principals used for managed identities that have the same name but different IDs

Everything is technically working right now, but I have introduced some needless ambiguity that can cause misunderstandings and issues.

Let’s say that I want to grant the Storage Blob Data Reader role to my data factory. I go to the storage account, choose to add a role assignment, select the role, and then go to add members. This is what I see:

The user interface to select members to add to a role assignment shows users and service principals by name, so ti contains two objects named adf-deploydemo-dev
Which managed identity belongs to the data factory?

Or let’s say that I use PowerShell to get lists of resources by name. I may be locating resources to add tags, add a resource lock, or move the resource to another region or resource group.

Executing PowerShell command Get-AzResource - Name 'adf-deploydemo-dev' | ft
Getting resources by name returns all three resources

If I don’t specify the resource type, I will get my data factory, my database, and my server in the results. You may be saying “Well, I would always specify the type.” Even if that is true, are you sure all coworkers and consultants touching your Azure resources would do the same?

Why introduce this ambiguity when there is no need to do so?

There are some good tips in the Cloud Adoption Framework in Microsoft Docs about naming conventions. Your organization probably wants to decide up front what names are acceptable and then use Azure Policy as well as good processes to ensure adherence to your defined conventions. And if I were the consultant advising you, I would suggest that resources within your tenant be unique across resource types. The suggestion in Docs is to use a resource type abbreviation at the beginning of your resource name. That would avoid the issue I have demonstrated above. Naming conventions should be adjusted to your organization’s needs, but the ones suggested in Docs are a good place to start if you need some help. It is beneficial to have some kind of resource naming convention beyond just whatever is allowed by Azure.

Azure, Azure Data Factory, Microsoft Technologies

Control Flow Limitations in Data Factory

Control Flow activities in Data Factory involve orchestration of pipeline activities including chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline. They also include custom-state passing and looping containers.

The activities list in the ADF Author & Manage app, showing Lookup, Set variable, Filter, For Each, Switch, and more.
Control Flow activities in the Data Factory user interface

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.

  1. You cannot nest For Each activities.
    Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.
  2. You cannot put a For Each activity or Switch activity inside of an If activity.
    The Data Factory UI will prevent you from doing this by removing the For Each and Switch from the activity list. You can redesign the pipeline to put the inner activity inside a child pipeline. Also note that you can put an If activity inside of a For Each activity.
  3. You cannot use a Set Variable activity inside a For Each activity that runs in parallel.
    The Data Factory UI won’t stop you, but you’ll quickly learn that the scope of the variable is the pipeline and not the For Each or any other activity. So you’ll just overwrite the value in no particular order as the activities execute in parallel. The workaround for this is specific to your use case. You might try using an existing attribute of the item you are iterating on instead of setting a variable. Append Variable works fine, since each loop could add a value. But again, don’t count on the order being meaningful.
  4. You cannot nest If activities.
    The Data Factory UI will prevent you from nesting the If activities. If you need to have two sets of conditions, you can either combine conditions or move the inner condition to a child pipeline.
  5. You cannot nest Switch activities.
    Similar to the If activity, the Data Factory UI will prevent you from nesting Switch activities. And again, you can either combine conditions or move the inner condition to a child pipeline.
  6. You cannot put a For Each or If activity inside a Switch activity.
    The Data Factory UI will prevent you from doing this. You can move the inner activity to a child pipeline if needed.
  7. You cannot use an expression to populate the pipeline in an Execute Pipeline activity.
    It would be great to design a truly dynamic pipeline where you could have a dataset that defines which pipelines to execute, but you can’t do that natively in the Data Factory UI. The Invoked Pipeline property doesn’t allow dynamic expressions. If you need to dynamically execute pipelines, you can use Logic Apps or Azure Functions to execute the pipeline.
  8. You cannot dynamically populate the variable name in Set Variable and Append Variable activities.
    The Data Factory UI only allows you to choose from a list of existing variables. As a workaround, you could use an If activity to determine which variable you will populate.
  9. The Lookup activity has a maximum of 5,000 rows and a maximum size of 4 MB.
    If you need to iterate over more than 5000 rows, you’ll need to split your list between a child and parent pipeline.

In addition to the items mentioned above, also note these resource limits listed in Microsoft Docs. Limits like 40 activities per pipeline (including inner activities for containers) can bite you if you aren’t careful about implementing a modular design. And if you do have a modular design with lots of pipelines calling other pipelines, be aware that you are limited to 100 queued runs per pipeline and 1,000 concurrent pipeline activity runs per subscription per Azure Integration Runtime region. I don’t hit these limits too often, but I have hit them.

This is not to say you can’t create good solutions in Azure Data Factory—you absolutely can. But Data Factory has some limitations that you might not expect if you have experience working with other data integration/orchestration tools.

Have you hit any other limits that caused you to design your pipelines differently that you would like to share with others? If so, leave me a comment.

Azure, Azure Data Factory, Microsoft Technologies

Azure Data Factory Activity Failures and Pipeline Outcomes

Question: When an activity in a Data Factory pipeline fails, does the entire pipeline fail?
Answer: It depends

In Azure Data Factory, a pipeline is a logical grouping of activities that together perform a task. It is the unit of execution – you schedule and execute a pipeline. Activities in a pipeline define actions to perform on your data. Activities can be categorized as data movement, data transformation, or control activities.

In many instances, when an activity fails during a pipeline run, the pipeline run will report failure as well. But this is not always the case.

There are two main scenarios where an activity would report failure, but the pipeline would report success:

  • The maximum number of retry attempts is greater than 0, and the initial activity execution fails but the second attempt succeeds
  • The failed activity has a failure path or a completion path to a subsequent activity and no success path

Retry Attempts

In the General settings of any activity is a property called Retry. This is the number of times Data Factory can try to execute the activity again if the initial execution fails. The default number of retries is 0. If we execute a pipeline containing one activity with the default Retry setting, the failure of the activity would cause the pipeline to fail.

Data Factory Web UI  showing the General settings of an activity with the Retry property
Data Factory Activity General settings showing the Retry Property

I often set retries to a non-zero number in copy activities, lookups, and data flows in case there are transient issues that would cause a failure that might not be present if we waited 30 seconds and tried the activity again.

Data Factory Monitoring activity runs within a pipeline. An activity failed the first time, was rerun, and succeeded the second time
Output of a Data Factory activity that was executed and initially failed. Since it was set to have 1 retry, it executed again and succeeded. If nothing else in the pipeline failed, the pipeline would report success.

Dependency with a Failure Condition

Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed. If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.

Activity1 has a success path to Activity2. Activity1 failed so Activity2 did not execute.
Because Activity1 failed, Activity2 is not executed and the pipeline fails.

But if we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.

Activity1 has a failure path to Activity2. Activity1 failed and Activity2 succeeded.
The failure dependency means this pipeline reports success.

Now let’s say we have a pipeline with 3 activities, where Activity1 has a success path to Activity2 and a failure path to Activity3. If Activity1 fails and Activity3 succeeds, the pipeline will fail. The presence of the success path alongside the failure path changes the outcome reported by the pipeline, even though the activity executions from the pipeline are the same as the previous scenario.

Activity1 has a success path to Activity2 and a failure path to Activity3. Activity1 failed, Activity2 was skipped, and Activity3 succeeded.
Activity1 fails, Activity2 is skipped, and Activity3 succeeds. The pipeline reports failure.

What This Means for Monitoring

This difference between pipeline and activity status has a few implications of which we should be aware as we monitor our data factories.

If we are using Azure Monitor alerts, we need to understand that setting an alert for pipeline failures doesn’t catch all activity failures. If there is a retry of an activity and the second attempt is successful, there would be an activity failure but no pipeline failure.

Conversely, if we set an alert to notify us of activity failures, and we have a pipeline designed with the try-catch pattern, we might get an alert about an activity failure, but the pipeline would still show success. You would need to look at the status of the activities within the pipeline execution to see the failure of which you were alerted.

For many of my implementations, just setting an alert to notify me when any activity failure occurs is fine. For others, I really only care if the pipeline fails. Sometimes I need to set more specific alerts where I choose only certain activities to monitor for failure.

You could also use the Data Factory SDK to roll your own monitoring solution. If you write PowerShell, C#, or Python, you can retrieve the status of any pipeline or activity run and take subsequent actions based upon the results.

What This Means for Pipeline Design

You may need to add activities to your pipelines to support your monitoring scenarios if you need something more customized than what is offered from Azure Monitor and don’t want to use the SDK.

If you have notification needs that Azure Monitor can’t accommodate, you could add an activity in your pipelines to send an email based upon your desired activity outcomes. You can cause that activity to execute using an activity dependency alone, or by combining it with a variable and an If Condition activity.

There are times where we may need a pipeline to fail even though we are using the try-catch pattern that results in pipeline success. In that case, I add an additional web activity to the end of my pipeline failure path that hits an invalid url like http://throwanerror.  The failure of this activity will cause the pipeline to fail. Keep monitoring and notifications in mind as you design your pipelines so you are alerted as appropriate.

Azure Data Factory Activity and Pipeline Outcomes

To help clarify these concepts I made the below guide to Data Factory activity and pipeline outcomes. Feel free to share it with others. You can download it directly from this link. A text version that should be friendlier for screen readers can be found on this page.

Azure, Azure Data Factory, KQL, Microsoft Technologies

Retrieving Log Analytics Data with Data Factory

I’ve been working on a project where I use Azure Data Factory to retrieve data from the Azure Log Analytics API. The query language used by Log Analytics is Kusto Query Language (KQL). If you know T-SQL, a lot of the concepts translate to KQL. Here’s an example T-SQL query and what it might look like in KQL.

--T-SQL: 
SELECT * FROM dbo.AzureDiagnostics 
WHERE TimeGenerated BETWEEN '2020-12-15 AND '2020-12-16'
AND database_name_s = 'mydatabasename'
//KQL: 
AzureDiagnostics 
| where TimeGenerated between(datetime('2020-12-15') .. datetime('2020-12-16')) 
| where database_name_s == 'mydatabasename'

For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).

You can use an Azure Data Factory copy activity to retrieve the results of a KQL query and land them in an Azure Storage account. You must first execute a web activity to get a bearer token, which gives you the authorization to execute the query.

Data Factory pipeline containing a web activity to get a bearer token and a copy activity to copy data from the Log Analytics API.
Data Factory pipeline that retrieves data from the Log Analytics API.

I had to create an app registration in Azure Active Directory for the web activity to get the bearer token. The web activity should perform a POST to the following url (with your domain populated and without the quotes): "https://login.microsoftonline.com/[your domain]/oauth2/token"

Make sure you have added the appropriate header of Content-Type: application/x-www-form-urlencoded. The body should contain your service principal information and identify the resource as "resource=https://api.loganalytics.io". For more information about this step, see the API documentation.

Data Factory Copy Activity

The source of the copy activity uses the REST connector. The base url is set to "https://api.loganalytics.io/v1/workspaces/[workspace ID]/" (with your workspace ID populated and without the quotes). Authentication is set to Anonymous. Below is my source dataset for the copy activity. Notice that the relative url is set to “query”.

Connection properties of a dataset in Azure Data Factory. The base url points to https://api.loganalytics.io/v1/workspaces/[workspaceid] with the workspace ID not shown. The relative url contains the string "query".
ADF Dataset referencing a REST linked service pointing to the Log Analytics API

The Source properties of the copy activity should reference this REST dataset. The request method should be POST, and the KQL query should be placed in the request body (more on this below).

Two additional headers need to be added in the Source properties.

Additional Headers section of a Data Factory copy activity. Two headers are shown. 1) content-type: application/json; charset=utf-8 2) Authorization: @concat('Bearer ', activity('Get Bearer Token').output.access_token)
Additional headers in the Source properties of the ADF copy activity

The Authorization header should pass a string formatted as “Bearer [Auth Token]” (with a space between the string “Bearer” and the token). The example above retrieves the token from the web activity that executes before the copy activity in the pipeline. Make sure you are securing your inputs and outputs so your secrets and tokens are not being logged in Data Factory. This option is currently found on the General properties of each activity.

Embedding a KQL Query in the Copy Activity

You must pass the KQL query to the API as a JSON string. But this string is already inside the JSON created by Data Factory. Data Factory is a bit picky in how you enter the query. Here is an example of how to populate the request body in the copy activity.

{
"query": "AzureDiagnostics | where TimeGenerated between(datetime('2020-12-15') .. datetime('2020-12-16')) | where database_name_s == 'mydatabasename'" 
}

Note that the curly braces are on separate lines, but the query must be on one line. So where I had my query spread across 3 lines in the Log Analytics user interface as shown at the beginning of this post, I have to delete the line breaks for the query to work in Data Factory.

The other thing to note is that I am using single quotes to contain string literals. KQL supports either single or double quotes to encode string literals. But using double quotes in your KQL and then putting that inside the double quotes in the request body in ADF leads to errors and frustration (ask me how I know). So make it easy on yourself and use single quotes for any string literals in your KQL query.

In my project, we were looping through multiple databases for customized time frames, so my request body is dynamically populated. Below is a request body similar to what I use for my copy activity that retrieves Azure Metrics such as CPU percent and data storage percent. The values come from a lookup activity. In this case, the SQL stored procedure that is executed by the lookup puts the single quotes around the database name so it is returned as ‘mydatabasename’.

{
"query": "AzureMetrics | where TimeGenerated between (datetime(@{item().TimeStart}) .. datetime(@{item().TimeEnd})) | where Resource == @{item().DatabaseName} | project SourceSystem , TimeGenerated , Resource, ResourceGroup , ResourceProvider , SubscriptionId , MetricName , Total , Count , Maximum , Minimum , TimeGrain , UnitName , Type, ResourceId"
}

With dynamically populated queries like the above, string interpolation is your friend. Paul Andrew’s post on variable string interpolation in a REST API body helped me understand this and get my API request to produce the required results.

You can do similar things with Data Factory to query the Application Insights API. In fact, this blog post on the subject helped me figure out how to get the Log Analytics data I needed.

Be Aware of API Limits

There are limits to the frequency and amount of data you can pull from the Log Analytics API. As noted in the API documentation:

  • Queries cannot return more than 500,000 rows
  • Queries cannot return more than 64,000,000 bytes (~61 MiB total data)
  • Queries cannot run longer than 10 minutes (3 minutes by default)

If there is a risk that you may hit the limit on rows or bytes, you need to be aware that the Log Analytics API does not return an error in this case. It will return the results up to the limit and then note the “partial query failure” in the result set. As far as I can tell, there is no option for pagination, so you will need to adjust your query to keep it under the limits. My current process uses a Get Metadata activity after the copy activity to check file sizes for anything close to the limit and then breaks that query into smaller chunks and re-executes it.

It’s All in the Details

I had a lot of trial and error as I worked my way through populating the request body in the API call and dealing with API limits. I hope this helps you avoid some of the pitfalls.

Azure, Azure Data Factory, Microsoft Technologies, Power BI

Refreshing a Power BI Dataset in Azure Data Factory

I recently needed to ensure that a Power BI imported dataset would be refreshed after populating data in my data mart. I was already using Azure Data Factory to populate the data mart, so the most efficient thing to do was to call a pipeline at the end of my data load process to refresh the Power BI dataset.

Power BI offers REST APIs to programmatically refresh your data. For Data Factory to use them, you need to register an app (service principal) in AAD and give it the appropriate permissions in Power BI and to an Azure key vault.

I’m not the first to tackle this subject. Dave Ruijter has a great blog post with code and a step-by-step explanation of how to use Data Factory to refresh a Power BI dataset. I started with his code and added onto it. Before I jump into explaining my additions, let’s walk through the initial activities in the pipeline.

ADF pipeline that uses web activities to gets secrets from AKV, get an AAD auth token, and call the Power BI API to refresh a dataset. Then and Until activity and an If activity are executed.
Refresh Power BI Dataset Pipeline in Data Factory

Before you can use this pipeline, you must have:

  • an app registration in Azure AD with a secret
  • a key vault that contains the Tenant ID, Client ID of your app registration, and the secret from your app registration as separate secrets.
  • granted the data factory managed identity access to the keys in the key vault
  • allowed service principals to use the Power BI REST APIs in in the Power BI tenant settings
  • granted the service principal admin access to the workspace containing your dataset

For more information on these setup steps, read Dave’s post.

The pipeline contains several parameters that need to be populated for execution.

ADF pipeline parameters

The first seven parameters are related to the key vault. The last two are related to Power BI. You need to provide the name and version of each of the three secrets in the key vault. The KeyVaultDNSName should be https://mykeyvaultname.vault.azure.net/ (replace mykeyvaultname with the actual name of your key vault). You can get your Power BI workspace ID and dataset ID from the url when you navigate to your dataset settings.

The “Get TenantId from AKV” activity retrieves the tenant ID from the key vault. The “Get ClientId from AKV” retrieves the Client ID from the key vault. The “Get Secret from AKV” activity retrieves the app registration secret from the key vault. Once all three of these activities have completed, Data Factory executes the “Get AAD Token” activity, which retrieves an auth token so we can make a call to the Power BI API.

One thing to note is that this pipeline relies on a specified version of each key vault secret. If you always want to use the current version, you can delete the SecretVersion_TenantID, SecretVersion_SPClientID, and SecretVersion_SPSecret parameters. Then change the expression used in the URL property in each of the three web activities .

For example, the URL to get the tenant ID is currently:

@concat(pipeline().parameters.KeyVaultDNSName,'secrets/',pipeline().parameters.SecretName_TenantId,'/',pipeline().parameters.SecretVersion_TenantId,'?api-version=7.0')

To always refer to the current version, remove the slash and the reference to the SecretVersion_TenantID parameter so it looks like this:

@concat(pipeline().parameters.KeyVaultDNSName,'secrets/',pipeline().parameters.SecretName_TenantId,'?api-version=7.0')

The “Call Dataset Refresh” activity is where we make the call to the Power BI API. It is doing a POST to https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes and passes the previously obtained auth token in the header.

This is where the original pipeline ends and my additions begin.

Getting the Refresh Status

When you call the Power BI API to execute the data refresh, it is an asynchronous call. This means that the ADF activity will show success if the call is made successfully rather than waiting for the refresh to complete successfully.

We have to add a polling pattern to periodically check on the status of the refresh until it is complete.

We start with an until activity. In the settings of the until loop, we set the expression so that the loop executes until the RefreshStatus variable is not equal to “Unknown”. (I added the RefreshStatus variable in my version of the pipeline with a default value of “Unknown”.) When a dataset is refreshing, “Unknown” is the status returned until it completes or fails.

ADF Until activity settings

Inside of the “Until Refresh Complete” activity are three inner activities.

ADF Until activity contents

The “Wait1” activity gives the dataset refresh a chance to execute before we check the status. I have it configured to 30 seconds, but you can change that to suit your needs. Next we get the status of the refresh.

This web activity does a GET to the same url we used to start the dataset refresh, but it adds a parameter on the end.

https://docs.microsoft.com/en-us/resGET https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes?$top={$top}

The API doesn’t accept a request ID for the newly initiated refresh, so we get the last initiated refresh by setting top equal to 1 and assume that is the refresh for which we want the status.

The API provides a JSON response containing an array called value with a property called status.

In the “Set RefreshStatus” activity, we retrieve the status value from the previous activity and set the value of the RefreshStatus variable to that value.

Setting the value of the RefreshStatus variable in the ADF pipeline

We want the status value in the first object in the value array.

The until activity then checks the value of the RefreshStatus variable. If your dataset refresh is complete, it will have a status of “Completed”. If it failed, the status returned will be “Failed”.

The If activity checks the refresh status.

If activity expression in the ADF pipeline

If the refresh status is “Completed”, the pipeline execution is finished. If the pipeline activity isn’t “Completed”, then we can assume the refresh has failed. If the dataset refresh fails, we want the pipeline to fail.

There isn’t a built-in way to cause the pipeline to fail so we use a web activity to throw a bad request.

We do a POST to an invalid URL. This causes the activity to fail, which then causes the pipeline to fail.

Since this pipeline has no dependencies on datasets or linked services, you can just grab my code from GitHub and use it in your data factory.

Azure, Azure Data Factory, Logic Apps, Microsoft Technologies

Using Logic Apps in a Data Factory Execution Framework – Part 1

Data Factory allows parameterization in many parts of our solutions. We can parameterize things such as connection information in linked services as well as blob storage containers and files in datasets. We can also parameterize certain properties in activities. For instance, we can write an expression to determine the stored procedure to be executed in a Stored Procedure Activity or the filename in the sink (destination) of a Copy Activity.

But we cannot parameterize the invoked pipeline in an Execute Pipeline Activity. This means we need to find workarounds in order to have a metadata-driven execution framework. What I mean by metadata-driven execution framework is that data is stored in a datastore (in my case, a SQL Database) and used to determine what pipelines and activities get executed. With this type of framework, if I don’t want a specific pipeline to execute, I would just update my data in the datastore rather than delete the pipeline execution from the parent pipeline. We’ve been doing this type of development in SSIS for years, and Biml has played a big part in that. But SSIS allows us to parameterize the Execute Package Task.

Since we can’t implement this parameterized execution of pipelines natively, we need to look for something that Data Factory can call to accomplish the task. Paul Andrew has a nice framework that uses Azure Functions. I was working on a Data Factory solution for a client who doesn’t have C# or PowerShell developers on hand to help with the ELT process, so we needed to explore a low-code solution.

While there is no Logic App activity in Data Factory, we can use a Web Activity to call the Logic App. I might have a pipeline that looks something like what is pictured below.

Data Factory pipeline that uses a Stored Procedure to capture the start of the pipeline, a Lookup to get the list of files to be copied, a ForEach loop to copy each of the files, and a Stored Procedure to mark the end of the pipeline.
Staging pipeline that copies files from Azure Data Lake Storage to Azure SQL Database

Within the ForEach loop is a single Web Activity.

Data Factory Pipeline Web Activity calling a Logic App. An expression populates the url, and a Get m
Web Activity that calls a Logic App

I used some variables and parameters in an expression to populate the URL so it would be dynamic. I used a GET method in the call.

My initial version of my Logic App is shown below.

Logic App workflow with an HTTP request trigger. 1) Create a pipeline run. 2) Initialize Variable. 3) Until loop. 4) HTTP Response.
Logic App that executes a Data Factory pipeline and waits for it to complete before returning a response

I added path parameters in my HTTP request trigger to allow me to capture the information I need to execute the appropriate pipeline. For me this included the pipeline name, a data source ID, and a country. Your parameters would vary according to your requirements.

HTTP Request trigger in a logic app with 3 path parameters: pipeline, country, Data Source ID
HTTP Request trigger in my Logic App

Logic apps has an action called “Create a pipeline run”. You tell it which data factory, which pipeline, and any parameter values needed for the pipeline execution.

Create a pipeline run action in a logic app. Data Factory Pipeline Name is populated by a parameter. The pipeline parameters are populated by a mix of static JSON and parameters.
Create a pipeline run action in my Logic App

At this point in the workflow, our pipeline would be executing. But now we need to know when it has finished. That’s what the Initialize Variable and Until Loop actions are handling. I created a string variable called Pipeline Status and set the default value to “InProgress”. My Until loop action checks my pipeline execution status. If it’s still running, it waits 5 seconds, gets the new status, and assigns that status to the variable. This repeats until the pipeline execution is no longer in progress.

Here’s the expression I used to check whether the pipeline execution is still running:

@and(not(equals(variables('PipelineStatus'), 'InProgress')),
not(equals(variables('PipelineStatus'), 'Queued')))
Until loop in a logic app. Checks status of pipeline run. 1) Delay action. 2) Get a pipeline run. 3) Set variable.
Until loop in my Logic App to dynamically execute a Data Factory pipeline

Once the pipeline execution is complete, an HTTP response with the pipeline status is sent back to the caller.

HTTP Response action with status code 200 and pipeline status value in the body.
HTTP Response action in my Logic App

This is all great until you find out that Logic Apps will experience an HTTP timeout if the request takes more than 2 minutes.

Do you have any pipelines that take longer than two minutes to execute? If so, you need to change your solution to handle this. Note that you would have the same issue with Azure Functions, although it would give you 230 seconds instead of 120 seconds before it timed out. We need to switch to an asynchronous call to support long running pipelines. Paul has already done this in his framework using Azure Functions. In Logic Apps, we can change our response to an asynchronous response and then implement a polling pattern to check the status. We could alternatively implement a webhook action. I’ll write about updating the solution to handle long running pipelines in a future post.