In early 2022, Microsoft released a new activity in Azure Data Factory (ADF) called the Script activity. The Script activity allows you to execute one or more SQL statements and receive zero, one, or multiple result sets as the output. This is an advantage over the stored procedure activity that was already available in ADF, as the stored procedure activity doesn’t support using the result set returned from a query in a downstream activity.
However, when I went to find examples of how to reference those result sets, the documentation was lacking. It currently just says:
Similar to a Lookup activity, the Script activity can be used to populate the items in a ForEach activity, but the syntax is a bit different.
Let’s say we have a Script activity followed by a ForEach activity. The Script activity has a single result set.
A Script activity in Azure Data Factory with a ForEach activity
When I populate the items property of my ForEach activity, I use the following expression: @activity('SCR_ScriptActivity').output.resultSets[0].rows
It starts similar to how we reference output from a Lookup activity. I reference the activity and then the output. But then instead of values I use resultSets[0].rows.
This makes sense when you look at the output from the activity.
I want the output from the first (in this case, only) result set, so that’s resultSets[0]. The data returned is in the rows array in that result set. So that’s resultSets[0].rows.
How are you liking the script activity in Data Factory? Is there anything else you wish were included in the documentation? Let me know in the comments.
I’ve seen a few people start Azure Data Factory (ADF) projects assuming that we would have one source control repo per environment, meaning that you would attach a Git repo to Dev, and another Git repo to Test and another to Prod.
“Only the development factory is associated with a git repository. The test and production factories shouldn’t have a git repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource Management template.”
You’ll find the recommendation of only one repo connected to the development data factory echoed by other ADF practitioners, including myself.
We use source control largely to control code changes and for disaster recovery. I think the desire to use multiple repos is about disaster recovery more than anything. If something bad happens, you want to be able to access and re-deploy your code as quickly as possible. And since we start building in our repo-connected dev environment, some people feel “unprotected” in higher environments.
But Why Not Have All the Repos?
For me, there are two main reasons to have only one repository per solution tied only to the development data factory.
First, having multiple repos adds more complexity with very few benefits.
Deployment Complexity
Having a repo per environment adds extra work in deployment. I can see no additional benefits for deployment from having a repo per environment for most ADF solutions. I won’t say never, but I have yet to encounter a situation where this would help.
When you deploy your data factory, whether you use ARM templates or individual JSON files, you are publishing to the live ADF service. This is what happens when you publish from your collaboration branch in source control to the live version in your development data factory. If you follow normal deployment patterns, you deploy from the main (if you use JSON files) or adf_publish (if you use the ARM template) branch in source control to the live service in Test (or whatever your next environment is). If your Test data factory is connected to a repo, you need to figure out how to get your code into that repo.
Would you copy your code to the other repo before you deploy to the service? What if something fails in your deployment process before deployment to the live service is complete? Would you want to revert your changes in the Git repo?
You could deploy to the live service first and skip that issue. But you still need to decide how to merge your code into the Test repo. You’ll need to handle any merge conflicts. And you’ll likely need to allow unrelated histories for the merge to work, so when you look back in your commit history, it probably won’t make sense.
Comparison of having one repo vs one repo per environment shows the need for extra steps if you add repos
At best, this Test repo becomes an additional place to store the code that is currently in Test. But if you are working through a healthy development process, you already had this code in your Dev repo. You might have even tagged it for release, so it’s easy to find. Your Git repo is likely already highly available, if it is cloud-based. In my mind, this just creates one more copy of your code that can get out of date, and one more deployment step. If you just want a copy of what is in Test or Prod for safe keeping, you can always export the resource’s ARM template. But if I were to do that, I would be inclined to keep it in blob storage or somewhere outside of a repo, since I already have the code in a repo. This would allow me to redeploy if my repo weren’t available.
Then, once you have sufficiently tested your data factory in Test, would you deploy code to Prod from the Test repo or from the Dev repo?
If you have the discipline and DevOps/automation capabilities to support these multiple repos, you likely don’t want to do this, unless you have requirements that mandate it. That brings me to my second reason.
Deviation from Common DevOps Practice
Having a repo per environment is a deviation from common software engineering practices. Most software engineering projects do not have separate repos per environment. They might have separate repos for different projects within a solution, but that is a different discussion.
If you have a separate repo for dev and test, what do you do about history? I think there is also a danger that people would want to make changes in higher environments instead of working through the normal development process because it seems more expedient at the time.
When you hire new data engineers or dev ops engineers (whoever maintains and deploys your data factories), you would have to explain this process with the multiple repos as it won’t be what they are expecting.
Unless you have some special requirements that dictate this need, I don’t see a good reason to deviate from common practice.
Common Development Process
For a data factory project, we must define a collaboration branch, usually Main. This branch is the only branch that can publish to the live service in your Dev data factory. When you need to update your data factory, you make a (hopefully short-lived) feature branch based off of your collaboration branch. My preference for a medium to large project is to have the Main branch, an Integration branch, and one or more feature branches. The Integration branch brings multiple features together for testing before the final push to Main. On smaller projects with one or two experienced developers, you may not need the integration branch. I find that I like the integration branch when I am working with people who are new to ADF, as it gives me a chance to tweak and execute new pipelines before they get to Main.
Developers work in the feature branches and then merge into the integration branch as they see fit. They resolve any errors and make any final changes in integration and then create a pull request to get their code into Main. Once the code is merged into Main and published to the live service (either manually or programmatically), the feature branches and Integration branch are deleted, preparing you to start the next round of development. Triggering the pipelines in the live service after publishing gives you a more realistic idea of execution times as ForEach activities may not run in parallel when executed in debug mode.
The development starts by creating feature branches and an integration branch from Main. Code is merged into Integration by the developer. Code in integration is moved to Main via pull request. Code from Main is published to the live service.
The code in Main should represent a version of your data factory that is ready to be deployed to Test. Code is deployed from Dev to Test according to your preference—I won’t get into all the options of JSON files vs ARM templates and DevOps pipelines vs PowerShell/custom code in this post.
You perform unit testing, integration testing, and performance testing (and any other type of testing as well, but most people aren’t really doing these three in any sufficient manner) in your Test data factory. If everything looks good, you deploy to Production. If there are issues, you go back to your development data factory, make a new feature branch from Main, and fix the issue.
If you find a bug in production, and you can’t use the current version of code in Main, you might want to create a hotfix/QFE. Your hotfix is still created in your development data factory. The difference is that instead of creating a feature branch from Main, you create the branch from the last commit deployed to production. If you are deploying via ARM templates, you can export the ARM template from that hotfix branch and manually check it in to the adf_publish branch. If you deploy from JSON files, selective deployment is a bit easier. I like to use ADF Tools for deployment, which allows me to specify which files should be deployed, so I can do a special hotfix deployment that doesn’t change any other objects that may have already been updated in Main in anticipation of the next deployment.
In Summary
Having a repo per environment doesn’t technically break anything, but it adds complexity without significant benefits. It adds steps to your deployment process and deviates from industry standards. I won’t go so far as saying “never”, as I haven’t seen every project scenario. If you were considering going this route, I would encourage you to examine the reasons behind it and see if doing so would actually meet your needs and if your team can handle the added complexity.
I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.
Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.
You can see queue time in the ADF Monitor by looking at the output of an activity.
Output from a stored procedure activity showing 2 seconds of queue time
But what if you want to see activity queue times across activities, across pipelines, or even across data factories? Then you need to output your logs to somewhere that makes this easier.
Send ADF Logs to Log Analytics
You can output your ADF logs to a storage account, to Log Analytics, to an event hub, or to a partner solution. I prefer Log Analytics because it’s easy to query and look for trends using KQL.
To configure the output to Log Analytics, you must create a Log Analytics workspace (if you don’t have an existing one) and add a diagnostic setting to the data factory resource. Once you have data feeding into Log Analytics, you can query it.
If you choose resource-specific destination tables in the diagnostic setting, you will find a table in Log Analytics called ADFActivityRun. This table contains a column called Output. The Output column contains the JSON we see in the ADF Studio Monitor app.
KQL has functions for parsing JSON and retrieving only the JSON objects I want to include. This means that I could write a query like the following.
ADFActivityRun | extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue',Output, typeof(int)) | where Status == 'Succeeded' | where queuetime > 0 | project Start, End, PipelineName, ActivityType, ActivityName, dur = datetime_diff('second', End, Start), queuetime, PipelineRunId, ActivityRunId | sort by queuetime desc
This query gives me a list of activities with successful executions that have queue times greater than zero. I can choose from any columns in the ADFActivityRun table, including the pipeline and activity names, start and end times, activity types, and run IDs. Duration is not an available column so I had to calculate it by calculating the difference between the start and end time. The queue time is buried in the JSON in the Output column, so I used the extractjson function to get the duration in queue value.
Now that I know how to get the queue duration, I can look for trends across various slices of data. A query to get average queue time by activity type might look like the below.
ADFActivityRun | where Status == 'Succeeded' | where startofday(Start) == datetime(2022-01-04) | extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue', Output, typeof(int)) | summarize avg_queuetime = avg(queuetime) by ActivityType | sort by avg_queuetime desc
In this query, I am retrieving activities with successful executions that started on January 4, 2022. I added my calculation to retrieve queue time. Then I calculated average queue time by activity type using the summarize operator and sorted the result descending by queue time.
I could have filtered on any other available activity: pipeline name, activity name, integration runtime, resource ID, and more.
This information is available via the API as well, but a Log Analytics workspace can be spun up and running in minutes without having to write code to connect.
I was working on an Azure Data Factory project for a client who is new to ADF, and there was a miscommunication about the new Git Repo to be used for source control. Someone had created a new project and repo instead of using the existing one created for this purpose. This isn’t a big deal, as it’s easy enough to change in ADF Studio.
The Management Hub in Azure Data Factory Studio contains the Git configuration settings
In the Management Hub, you can change the Git configuration for the data factory, and there is a button near the top to disconnect the repo. You may need to do this if you run into conflicts in the publish branch or when you need to change repos.
It should be as easy as that, but I ran into a situation where it wasn’t.
Disconnect Button Unavailable
When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.
I asked my client to disconnect the repo and moved on with the project, but I also logged some feedback for the Data Factory team. You can do this by selecting the Feedback button near the top right of the page in Azure Data Factory Studio. I have done this a few times and always received a prompt response.
You can send feedback from within Azure Data Factory Studio
I sent my feedback and received an email response within a couple of days. It contained a few helpful bits of information.
To disconnect git from the ADF management tab, you must be in git mode and be able to access the repository. If you are not able to access the repository but have permissions to update the factory, you can remove the git configuration via the REST API. Here is documentation on this API call: Factories – Configure Factory Repo – REST API (Azure Data Factory) | Microsoft Docs. In this case, repoConfiuration should be set to null in the request body, and the rest of the PUT body should be the same as the existing settings.
Alternatively, the git connection can be removed from the ADF management tab by either gaining access to the repository or having another ADF user with access remove the configuration. I also understand that this is not an ideal experience, so I have filed a work item to remove the requirement to have access to the repository.
This identified my situation exactly. I did not have access to the new repo that had been connected to the data factory. That caused me to be unable to disconnect it.
Hopefully the work item that removes the requirement for repo access before you can disconnect the data factory will be completed soon, but if you run into this issue, you can resolve it on your own using the API or a colleague with repo access.
I recently worked on a project where we needed to copy some large files from a specified library in SharePoint Online. In that library, there were several layers of folders and many different types of files. My goal was to copy files that had a certain file extension and a file name that started with a specific string.
I began by comparing the capabilities between Azure Data Factory and Logic Apps. Logic Apps allows me to start the process via a webhook. It also has a SharePoint online connector that provides easy AAD authentication. And there is a special operation that allows me to retrieve the properties (path, filename) of files without retrieving the files. This makes it easy to loop through the list of files that is already filtered to only the files I want.
Logic App that retrieves file metadata from SharePoint and then performs an action for each file in the list
Azure Data Factory does have a SharePoint Online connector, but you can’t use it to copy files. You must use the HTTP connector to copy files from SharePoint in Data Factory. And that requires you to make a web call first to get a bearer token to be used in the authentication for the source of the copy activity.
While the ADF options work, I thought I would go with the Logic App because it felt cleaner and easier for others to understand. That is until I tried to copy some large files and encountered the following error.
Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'.
It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.
At that point I would have to figure out how to handle the chunking using an HTTP call. That made the Logic App no better than Data Factory for my purposes. And since I already had part of the larger process in Data Factory, I went back there to see how it handled chunking.
Copying a Subset of Files from SharePoint Online Using Data Factory
And it turns out you don’t have to do anything special in Data Factory to handle the chunking of the large files – it does that work for you.
So I set up a similar process to what I had in my Logic App in my ADF pipeline.
ADF pipeline that retrieves a list of files in a SharePoint library and performs an action on each file
First, I used a Lookup activity to get the list of files. When you do a lookup against a SharePoint Online dataset, you can use OData filters to limit the data returned by the lookup. In my case this was necessary because I only wanted 144 of the items out of the list of 4300. Lookups can only return 5,000 rows, so I definitely needed to add filters to my query. Also, folders are considered an item in the data returned, so if you are looking only for files, you need to filter out folders (either by content type or by the name of the file).
Data Factory Lookup against a SharePoint Online list using an OData query with filters
The results of the lookup, returning one item per file, are fed into a For Each activity. Inside the For Each activity, I placed a Web activity to get the necessary bearer token and a Copy activity to copy the file to blob storage.
Inside the For Each activity is a Web activity to retrieve a bearer token, followed by a Copy activity that copies the files to blob storage.
To get a bearer token, you need a service principal with the correct permissions in SharePoint. Then you make a POST to https://accounts.accesscontrol.windows.net/{tenantID}/tokens/OAuth/2.
You need the following header: Content-Type: application/x-www-form-urlencoded.
In the body of the HTTP call, you must include: grant_type=client_credentials&client_id={clientid}@{tenantid}&client_secret={secretvalue}&resource=00000003-0000-0ff1-ce00-000000000000/{SharePoint tenant}.sharepoint.com@{tenantID}.
Make sure to set Secure Output so your credentials aren’t written to the ADF logs. Preferably, you would store the client ID and client secret in Key Vault and use a web activity to look them up before getting the token.
You can then use the output from the web activity in your copy activity.
Source settings for a Copy activity in ADF that copies files from SharePoint
The Additional Headers should be populated with the auth token as shown below:
Since I don’t need to read the file contents, just copy it to blob storage, I used a binary dataset. I parameterized the dataset and linked service so my For Each activity could feed it the path and name of each file. This is because the lookup provides the path separately from the file name (which includes the file extension).
Binary dataset using an HTTP linked service used for retrieve files from SharePoint online
The linked service uses the HTTP connector. It has two parameters added: path and name.
HTTP linked service used to retrieve files in a SharePoint Online library in Data Factory
If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.
So once I add my destination info to the Copy Activity, I can copy all the files retrieved from my lookup to blob storage.
A Few Tricky Things
Everyone’s SharePoint tenant is set up differently. Your url will vary based upon the configuration of your site collections/sites. When I tried this in my tenant, I didn’t have to include the “site/sitename” before “/_api/web…”. But I did in my client’s tenant when I was building this.
We granted permissions to the service principal at the site collection level to get this to work. If there was a way to grant lower permissions, we couldn’t figure it out. Here the XML we used when the SharePoint admin was granting permissions.
Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error. That is why I moved my web activity inside the For Each activity. If you know it’s only a few files and they are quick, you should just get one bearer token and use it for all the files.
Lastly, be sure to set retries and timeouts appropriately. Remember that the default timeout for an activity is 7 days, and that is probably not desirable.
It took a lot of trial and error for me to get this working, as I couldn’t find many blog posts or docs about this. Special thanks to Bill Fellows for helping me through it. I sincerely hope this blog post helps someone avoid my struggles.
Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.
Here’s our agenda from the presentation.
List of top regrets from Data Factory users that they wish they had understood from the beginning
If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.
If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.
I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.
Each resource type in Azure has a naming scope within which the resource name must be unique. For PaaS resources such as Azure SQL Server (server for Azure SQL DB) and Azure Data Factory, the name must be globally unique within the resource type. This means that you can’t have two data factories with the same name, but you can have a data factory and a SQL server with the same name. Virtual machine names must be unique within the resource group. Azure Storage accounts must be globally unique. Azure SQL Databases should be unique within the server.
Since Azure allows you to create a data factory and a SQL server with the same resource name, you may think this is fine. But you may want to avoid this, especially if you plan on using system-defined managed identities or using Azure PowerShell/CLI. And if you aren’t planning on using these things, you might want to reconsider.
I ran into this issue of resources with the same name in a client environment and then recreated it in my Azure subscription to better understand it.
I already had a data factory named adf-deploydemo-dev so I made an Azure SQL server named adf-deploydemo-dev and added a database with the same name.
A data factory, a SQL Database, and a SQL Server all with the same name in the same region and same resource group
Azure Data Factory should automatically create its system-assigned managed identity. It will use the resource name for the name of the service principal. When you go to create a linked service in Azure Data Factory Studio and choose to use Managed Identity as the authentication method, you will see the name and object ID of the managed identity.
Information shown in my data factory when creating a linked service for a storage account.
For the Azure SQL Server, we can create a managed identity using PowerShell. The Set-AzSqlServer cmdlet has an -AssignIdentity parameter, which creates the system-assigned managed identity.
Executing the PowerShell command to create a managed identity
If you use Get-AzSqlServer to retrieve the information and assign the Identity property to a variable, you can then see the system-assigned managed identity and its application ID.
Verifying the managed identity is in place for an Azure SQL server.
Now when I look in Active Directory, I can see both managed identities have the same name but different application IDs and object IDs.
Two managed service principals used for managed identities that have the same name but different IDs
Everything is technically working right now, but I have introduced some needless ambiguity that can cause misunderstandings and issues.
Let’s say that I want to grant the Storage Blob Data Reader role to my data factory. I go to the storage account, choose to add a role assignment, select the role, and then go to add members. This is what I see:
Which managed identity belongs to the data factory?
Or let’s say that I use PowerShell to get lists of resources by name. I may be locating resources to add tags, add a resource lock, or move the resource to another region or resource group.
Getting resources by name returns all three resources
If I don’t specify the resource type, I will get my data factory, my database, and my server in the results. You may be saying “Well, I would always specify the type.” Even if that is true, are you sure all coworkers and consultants touching your Azure resources would do the same?
Why introduce this ambiguity when there is no need to do so?
There are some good tips in the Cloud Adoption Framework in Microsoft Docs about naming conventions. Your organization probably wants to decide up front what names are acceptable and then use Azure Policy as well as good processes to ensure adherence to your defined conventions. And if I were the consultant advising you, I would suggest that resources within your tenant be unique across resource types. The suggestion in Docs is to use a resource type abbreviation at the beginning of your resource name. That would avoid the issue I have demonstrated above. Naming conventions should be adjusted to your organization’s needs, but the ones suggested in Docs are a good place to start if you need some help. It is beneficial to have some kind of resource naming convention beyond just whatever is allowed by Azure.
Control Flow activities in Data Factory involve orchestration of pipeline activities including chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline. They also include custom-state passing and looping containers.
Control Flow activities in the Data Factory user interface
If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.
You cannot nest For Each activities. Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.
You cannot put a For Each activity or Switch activity inside of an If activity. The Data Factory UI will prevent you from doing this by removing the For Each and Switch from the activity list. You can redesign the pipeline to put the inner activity inside a child pipeline. Also note that you can put an If activity inside of a For Each activity.
You cannot use a Set Variable activity inside a For Each activity that runs in parallel. The Data Factory UI won’t stop you, but you’ll quickly learn that the scope of the variable is the pipeline and not the For Each or any other activity. So you’ll just overwrite the value in no particular order as the activities execute in parallel. The workaround for this is specific to your use case. You might try using an existing attribute of the item you are iterating on instead of setting a variable. Append Variable works fine, since each loop could add a value. But again, don’t count on the order being meaningful.
You cannot nest If activities. The Data Factory UI will prevent you from nesting the If activities. If you need to have two sets of conditions, you can either combine conditions or move the inner condition to a child pipeline.
You cannot nest Switch activities. Similar to the If activity, the Data Factory UI will prevent you from nesting Switch activities. And again, you can either combine conditions or move the inner condition to a child pipeline.
You cannot put a For Each or If activity inside a Switch activity. The Data Factory UI will prevent you from doing this. You can move the inner activity to a child pipeline if needed.
You cannot use an expression to populate the pipeline in an Execute Pipeline activity. It would be great to design a truly dynamic pipeline where you could have a dataset that defines which pipelines to execute, but you can’t do that natively in the Data Factory UI. The Invoked Pipeline property doesn’t allow dynamic expressions. If you need to dynamically execute pipelines, you can use Logic Apps or Azure Functions to execute the pipeline.
You cannot dynamically populate the variable name in Set Variable and Append Variable activities. The Data Factory UI only allows you to choose from a list of existing variables. As a workaround, you could use an If activity to determine which variable you will populate.
The Lookup activity has a maximum of 5,000 rows and a maximum size of 4 MB. If you need to iterate over more than 5000 rows, you’ll need to split your list between a child and parent pipeline.
In addition to the items mentioned above, also note these resource limits listed in Microsoft Docs. Limits like 40 activities per pipeline (including inner activities for containers) can bite you if you aren’t careful about implementing a modular design. And if you do have a modular design with lots of pipelines calling other pipelines, be aware that you are limited to 100 queued runs per pipeline and 1,000 concurrent pipeline activity runs per subscription per Azure Integration Runtime region. I don’t hit these limits too often, but I have hit them.
This is not to say you can’t create good solutions in Azure Data Factory—you absolutely can. But Data Factory has some limitations that you might not expect if you have experience working with other data integration/orchestration tools.
Have you hit any other limits that caused you to design your pipelines differently that you would like to share with others? If so, leave me a comment.
Question: When an activity in a Data Factory pipeline fails, does the entire pipeline fail? Answer: It depends
In Azure Data Factory, a pipeline is a logical grouping of activities that together perform a task. It is the unit of execution – you schedule and execute a pipeline. Activities in a pipeline define actions to perform on your data. Activities can be categorized as data movement, data transformation, or control activities.
In many instances, when an activity fails during a pipeline run, the pipeline run will report failure as well. But this is not always the case.
There are two main scenarios where an activity would report failure, but the pipeline would report success:
The maximum number of retry attempts is greater than 0, and the initial activity execution fails but the second attempt succeeds
The failed activity has a failure path or a completion path to a subsequent activity and no success path
Retry Attempts
In the General settings of any activity is a property called Retry. This is the number of times Data Factory can try to execute the activity again if the initial execution fails. The default number of retries is 0. If we execute a pipeline containing one activity with the default Retry setting, the failure of the activity would cause the pipeline to fail.
Data Factory Activity General settings showing the Retry Property
I often set retries to a non-zero number in copy activities, lookups, and data flows in case there are transient issues that would cause a failure that might not be present if we waited 30 seconds and tried the activity again.
Output of a Data Factory activity that was executed and initially failed. Since it was set to have 1 retry, it executed again and succeeded. If nothing else in the pipeline failed, the pipeline would report success.
Dependency with a Failure Condition
Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed. If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.
Because Activity1 failed, Activity2 is not executed and the pipeline fails.
But if we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.
The failure dependency means this pipeline reports success.
Now let’s say we have a pipeline with 3 activities, where Activity1 has a success path to Activity2 and a failure path to Activity3. If Activity1 fails and Activity3 succeeds, the pipeline will fail. The presence of the success path alongside the failure path changes the outcome reported by the pipeline, even though the activity executions from the pipeline are the same as the previous scenario.
Activity1 fails, Activity2 is skipped, and Activity3 succeeds. The pipeline reports failure.
What This Means for Monitoring
This difference between pipeline and activity status has a few implications of which we should be aware as we monitor our data factories.
If we are using Azure Monitor alerts, we need to understand that setting an alert for pipeline failures doesn’t catch all activity failures. If there is a retry of an activity and the second attempt is successful, there would be an activity failure but no pipeline failure.
Conversely, if we set an alert to notify us of activity failures, and we have a pipeline designed with the try-catch pattern, we might get an alert about an activity failure, but the pipeline would still show success. You would need to look at the status of the activities within the pipeline execution to see the failure of which you were alerted.
For many of my implementations, just setting an alert to notify me when any activity failure occurs is fine. For others, I really only care if the pipeline fails. Sometimes I need to set more specific alerts where I choose only certain activities to monitor for failure.
You could also use the Data Factory SDK to roll your own monitoring solution. If you write PowerShell, C#, or Python, you can retrieve the status of any pipeline or activity run and take subsequent actions based upon the results.
What This Means for Pipeline Design
You may need to add activities to your pipelines to support your monitoring scenarios if you need something more customized than what is offered from Azure Monitor and don’t want to use the SDK.
If you have notification needs that Azure Monitor can’t accommodate, you could add an activity in your pipelines to send an email based upon your desired activity outcomes. You can cause that activity to execute using an activity dependency alone, or by combining it with a variable and an If Condition activity.
There are times where we may need a pipeline to fail even though we are using the try-catch pattern that results in pipeline success. In that case, I add an additional web activity to the end of my pipeline failure path that hits an invalid url like http://throwanerror. The failure of this activity will cause the pipeline to fail. Keep monitoring and notifications in mind as you design your pipelines so you are alerted as appropriate.
Azure Data Factory Activity and Pipeline Outcomes
To help clarify these concepts I made the below guide to Data Factory activity and pipeline outcomes. Feel free to share it with others. You can download it directly from this link. A text version that should be friendlier for screen readers can be found on this page.
I’ve been working on a project where I use Azure Data Factory to retrieve data from the Azure Log Analytics API. The query language used by Log Analytics is Kusto Query Language (KQL). If you know T-SQL, a lot of the concepts translate to KQL. Here’s an example T-SQL query and what it might look like in KQL.
--T-SQL:
SELECT * FROM dbo.AzureDiagnostics
WHERE TimeGenerated BETWEEN '2020-12-15 AND '2020-12-16'
AND database_name_s = 'mydatabasename'
//KQL:
AzureDiagnostics
| where TimeGenerated between(datetime('2020-12-15') .. datetime('2020-12-16'))
| where database_name_s == 'mydatabasename'
For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).
You can use an Azure Data Factory copy activity to retrieve the results of a KQL query and land them in an Azure Storage account. You must first execute a web activity to get a bearer token, which gives you the authorization to execute the query.
Data Factory pipeline that retrieves data from the Log Analytics API.
I had to create an app registration in Azure Active Directory for the web activity to get the bearer token. The web activity should perform a POST to the following url (with your domain populated and without the quotes): "https://login.microsoftonline.com/[your domain]/oauth2/token"
Make sure you have added the appropriate header of Content-Type: application/x-www-form-urlencoded. The body should contain your service principal information and identify the resource as "resource=https://api.loganalytics.io". For more information about this step, see the API documentation.
Data Factory Copy Activity
The source of the copy activity uses the REST connector. The base url is set to "https://api.loganalytics.io/v1/workspaces/[workspace ID]/" (with your workspace ID populated and without the quotes). Authentication is set to Anonymous. Below is my source dataset for the copy activity. Notice that the relative url is set to “query”.
ADF Dataset referencing a REST linked service pointing to the Log Analytics API
The Source properties of the copy activity should reference this REST dataset. The request method should be POST, and the KQL query should be placed in the request body (more on this below).
Two additional headers need to be added in the Source properties.
Additional headers in the Source properties of the ADF copy activity
The Authorization header should pass a string formatted as “Bearer [Auth Token]” (with a space between the string “Bearer” and the token). The example above retrieves the token from the web activity that executes before the copy activity in the pipeline. Make sure you are securing your inputs and outputs so your secrets and tokens are not being logged in Data Factory. This option is currently found on the General properties of each activity.
Embedding a KQL Query in the Copy Activity
You must pass the KQL query to the API as a JSON string. But this string is already inside the JSON created by Data Factory. Data Factory is a bit picky in how you enter the query. Here is an example of how to populate the request body in the copy activity.
{
"query": "AzureDiagnostics | where TimeGenerated between(datetime('2020-12-15') .. datetime('2020-12-16')) | where database_name_s == 'mydatabasename'"
}
Note that the curly braces are on separate lines, but the query must be on one line. So where I had my query spread across 3 lines in the Log Analytics user interface as shown at the beginning of this post, I have to delete the line breaks for the query to work in Data Factory.
The other thing to note is that I am using single quotes to contain string literals. KQL supports either single or double quotes to encode string literals. But using double quotes in your KQL and then putting that inside the double quotes in the request body in ADF leads to errors and frustration (ask me how I know). So make it easy on yourself and use single quotes for any string literals in your KQL query.
In my project, we were looping through multiple databases for customized time frames, so my request body is dynamically populated. Below is a request body similar to what I use for my copy activity that retrieves Azure Metrics such as CPU percent and data storage percent. The values come from a lookup activity. In this case, the SQL stored procedure that is executed by the lookup puts the single quotes around the database name so it is returned as ‘mydatabasename’.
{
"query": "AzureMetrics | where TimeGenerated between (datetime(@{item().TimeStart}) .. datetime(@{item().TimeEnd})) | where Resource == @{item().DatabaseName} | project SourceSystem , TimeGenerated , Resource, ResourceGroup , ResourceProvider , SubscriptionId , MetricName , Total , Count , Maximum , Minimum , TimeGrain , UnitName , Type, ResourceId"
}
With dynamically populated queries like the above, string interpolation is your friend. Paul Andrew’s post on variable string interpolation in a REST API body helped me understand this and get my API request to produce the required results.
You can do similar things with Data Factory to query the Application Insights API. In fact, this blog post on the subject helped me figure out how to get the Log Analytics data I needed.
Be Aware of API Limits
There are limits to the frequency and amount of data you can pull from the Log Analytics API. As noted in the API documentation:
Queries cannot return more than 500,000 rows
Queries cannot return more than 64,000,000 bytes (~61 MiB total data)
Queries cannot run longer than 10 minutes (3 minutes by default)
If there is a risk that you may hit the limit on rows or bytes, you need to be aware that the Log Analytics API does not return an error in this case. It will return the results up to the limit and then note the “partial query failure” in the result set. As far as I can tell, there is no option for pagination, so you will need to adjust your query to keep it under the limits. My current process uses a Get Metadata activity after the copy activity to check file sizes for anything close to the limit and then breaks that query into smaller chunks and re-executes it.
It’s All in the Details
I had a lot of trial and error as I worked my way through populating the request body in the API call and dealing with API limits. I hope this helps you avoid some of the pitfalls.