October 2021 is mapping month over at Workout Wednesday for Power BI. As part of our challenges, we build a sample report and use the Publish to Web functionality to share it on the website. While this has worked well all year, there are some visuals, including maps, that do not support or require a different license for use with Publish to Web.
It’s frustrating to build a Power BI report that you plan to share, only to find that you can’t share it. So I thought it would be helpful to consolidate what I have found about the various map visuals and their support of Publish to Web.
Disclaimer: This information is correct as of October 14, 2021. This could change over time. This is not an exhaustive list of all the map visuals available for Power BI.
Power BI offers the ability for users who have access to a dataset in the Power BI service (PowerBI.com) to connect to the dataset using Excel. Normally, this feature is referred to as Analyze in Excel. Once you connect Excel to your dataset, you can create Pivot Table reports or use Cube Functions.
There are currently limitations that mean this functionality isn’t supported for B2B (external) users. An external user is an Azure AD user that is based in another tenant and has been guested into the local AAD tenant. If you go to your dataset in PowerBI.com. choose Analyze in Excel, and then try to open the downloaded file and connect to the dataset, you will be met with connection errors.
But if you have your dataset in a workspace backed by Premium Per User or Premium capacity, you can use the XMLA endpoint to connect, even if you are using a B2B user!
In Excel, locate the Get Data button. Select From Database and then From Analysis Services.
Open a browser window and go to the settings of the Power BI workspace that contains the dataset to which you want to connect.
If your workspace is backed by Premium capacity, you will be able to see this in the settings and the workspace connection will be available for you to copy. If you are a member user (not external) you could copy this info into the Server Name box of the Data Connection Wizard and go on your way.
If you are a B2B user, you need to make an adjustment as noted in Microsoft Docs. You need to replace “myorg” in the workspace connection with your primary domain name. If you have access to the Azure portal, you can find the primary domain name on the overview page for the Azure Active Directory.
So if the workspace connection from the Power BI service is: powerbi://api.powerbi.com/v1.0/myorg/Demo%20Reports
And your primary domain is: mysupercooldomain.com
Then you would change the workspace connection to:
Once you have populated the server name with the workspace connection string, change the logon credentials to “Use the following user name and password” but leave the credentials blank. Once you select the Next button, you will be prompted for your Azure credentials.
Then you will be able to select the desired dataset from the workspace and be on your way to making connected Excel reports.
Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.
Here’s our agenda from the presentation.
If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.
If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.
I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.
As I was browsing Twitter today, I noticed a tweet from the Colorado Department of Transportation about their anti-DUI campaign. Shown below, it contains a bar chart that appears to have been presented in PowerPoint.
There are some easy opportunities to improve the readability of this chart, so I thought I would use it as an example of how small improvements can have a big impact on a fairly simple chart. I recreated the chart (as best I could) in Power BI and then made two revised versions.
Especially when making data visualizations for the general public —and especially when you want to get people’s engagement on social media— you need to reduce perceived cognitive load. Otherwise, people won’t even bother to read your chart. If your chart feels too busy or too complicated, many people in your intended audience will feel it is not worth the effort to even try to read it and will move on down their Twitter feed to the next Anakin and Padme meme.
Each resource type in Azure has a naming scope within which the resource name must be unique. For PaaS resources such as Azure SQL Server (server for Azure SQL DB) and Azure Data Factory, the name must be globally unique within the resource type. This means that you can’t have two data factories with the same name, but you can have a data factory and a SQL server with the same name. Virtual machine names must be unique within the resource group. Azure Storage accounts must be globally unique. Azure SQL Databases should be unique within the server.
Since Azure allows you to create a data factory and a SQL server with the same resource name, you may think this is fine. But you may want to avoid this, especially if you plan on using system-defined managed identities or using Azure PowerShell/CLI. And if you aren’t planning on using these things, you might want to reconsider.
I ran into this issue of resources with the same name in a client environment and then recreated it in my Azure subscription to better understand it.
I already had a data factory named adf-deploydemo-dev so I made an Azure SQL server named adf-deploydemo-dev and added a database with the same name.
Azure Data Factory should automatically create its system-assigned managed identity. It will use the resource name for the name of the service principal. When you go to create a linked service in Azure Data Factory Studio and choose to use Managed Identity as the authentication method, you will see the name and object ID of the managed identity.
For the Azure SQL Server, we can create a managed identity using PowerShell. The Set-AzSqlServer cmdlet has an -AssignIdentity parameter, which creates the system-assigned managed identity.
If you use Get-AzSqlServer to retrieve the information and assign the Identity property to a variable, you can then see the system-assigned managed identity and its application ID.
Now when I look in Active Directory, I can see both managed identities have the same name but different application IDs and object IDs.
Everything is technically working right now, but I have introduced some needless ambiguity that can cause misunderstandings and issues.
Let’s say that I want to grant the Storage Blob Data Reader role to my data factory. I go to the storage account, choose to add a role assignment, select the role, and then go to add members. This is what I see:
Or let’s say that I use PowerShell to get lists of resources by name. I may be locating resources to add tags, add a resource lock, or move the resource to another region or resource group.
If I don’t specify the resource type, I will get my data factory, my database, and my server in the results. You may be saying “Well, I would always specify the type.” Even if that is true, are you sure all coworkers and consultants touching your Azure resources would do the same?
Why introduce this ambiguity when there is no need to do so?
There are some good tips in the Cloud Adoption Framework in Microsoft Docs about naming conventions. Your organization probably wants to decide up front what names are acceptable and then use Azure Policy as well as good processes to ensure adherence to your defined conventions. And if I were the consultant advising you, I would suggest that resources within your tenant be unique across resource types. The suggestion in Docs is to use a resource type abbreviation at the beginning of your resource name. That would avoid the issue I have demonstrated above. Naming conventions should be adjusted to your organization’s needs, but the ones suggested in Docs are a good place to start if you need some help. It is beneficial to have some kind of resource naming convention beyond just whatever is allowed by Azure.
In week 26 of Workout Wednesday for Power BI, I asked people to calculate the age of Nobel laureates at the time they received the award. I provided some logic, but I didn’t prescribe how to create the age calculation. This inspired a couple of questions and a round of data validation as calculating age may be trickier than you think. In this post, I’ll explore some of the ways people have calculated age in Power BI and the edge cases where those calculations may not work.
In my solution video for Workout Wednesday, I used Power Query to calculate age. This was inspired by several blog posts and videos I had seen previously. There is an Age menu option in the Power Query editor under Date.
When you select a date column and use that Age option, it calculates the duration between the selected date and the current date in days. You must then replace the current date with the second date column. Next you can choose Total Years under Duration, which divides the days by 365. Finally, you must round that number down to the next integer to get years.
If you follow Ruth’s video, you can do all of that in one step that creates a custom column with the final age value.
Age Quotient (DAX):
Age Quotient =
VAR Birthdate = [Date1]
VAR ThisDay = [Date2]
VAR IntBirthdate = YEAR ( Birthdate ) * 10000 + MONTH ( Birthdate ) * 100 + DAY ( Birthdate )
VAR IntThisDay = YEAR ( ThisDay ) * 10000 + MONTH ( ThisDay ) * 100 + DAY ( ThisDay )
VAR Age = QUOTIENT ( IntThisDay - IntBirthdate, 10000 )
VAR CheckedAge = DIVIDE ( Age, NOT ISBLANK ( Birthdate ) )
As Marco points out, many people were using YEARFRAC, but there is a bug in the DAX implementation that causes it to occasionally return an incorrect answer for this purpose.
Checking the Numbers
I created a Power BI file to demonstrate the differences in these four calculations. You can download the file here. The image below displays the results in several tests. For each row, I’m using Date1 as the birthdate and Date2 as the “as of” date. You’ll notice that I focused on leap years for a few cases.
There are six of ten date ranges that have different results across the different calculation methods.
In the second row, the Power Query age calculation says that Feb 29 to Feb 28 in the following year is a full year. This may or may not be what you want depending on your requirements. I’m noting the difference so you can be aware. A similar thing occurs in the fifth row going from Feb 29, 2016 to Feb 28, 2020, and again on the 9th row going from March 1, 2019 to Feb 29, 2020.
On the third row, notice that the DAX DATEDIFF function calculates Feb 29 to Feb 27 of the following year to be a full year, despite it being a day or two short. Depending on what you do with leap years, you might consider Feb 29 to Feb 28 in the following year to be a full year, but that third row result means DATEDIFF is probably not the calculation I want. We see a similar result going from March 1 to Feb 28 of the following year.
YEARFRAC calculates that Feb 29 to Feb 28 in the following year is not a full year, which may be desirable. But it counts Feb 29, 2016 to Feb 29, 2020 as only three years. And we see that March 1, 2000 to March 1, 2021 is only counted as 20 years. So even without starting on a leap year, we get some incorrect results. Small numbers seem to be correct until it gets to about 13 years.
Using the QUOTIENT function provides what I consider to be the most correct results. It calculates Feb 29 to Feb 28 of the following year to be less than a year. It calculates Feb 29, 2016 to Feb 28, 2016 to be three years and not four. And it calculates March 1 to Feb 29 of the following year to be less than a year.
Which to use?
The QUOTIENT formula produces the most accurate results if you don’t want Feb 29 to Feb 28 the next year to be counted as a year. DATEDIFF and YEARFRAC produce too many incorrect results for me to ever suggest using them. Since there is a DAX option that produces more correct answers, I would just go for QUOTIENT instead of either of these two.
UPDATE: There is a better alternative! Imke Feldmann reminded me that there is an Number.IntegerDivide function in Power Query. So let’s take the logic from Marco’s DAX calculation and move it to Power Query:
(BirthDate as date, EndDate as date) => let BirthDateInt = Date.Year(BirthDate)10000 + Date.Month(BirthDate)100 + Date.Day(BirthDate), EndDateInt = Date.Year(EndDate)10000 + Date.Month(EndDate)100 + Date.Day(EndDate), Age = Number.IntegerDivide((EndDateInt - BirthDateInt),10000) in Age
System-versioned temporal tables were introduced in SQL Server 2016. They provide information about data stored in the table at any point in time by storing an effective dated version of each row rather than only the data that is correct at the current time
You can alter a temporal table to add or change columns, but you must first turn off system versioning. Let’s look at an example.
CREATE TABLE [dbo].[DatabaseSize](
[DatabaseID] [varchar](200) NOT NULL
,[ServerName] [varchar](100) NOT NULL
,[DatabaseName] [varchar](100) NOT NULL
,[SizeBytes] [bigint] NULL
,[SizeMB] AS ([SizeBytes]/(1048576))
,[ValidFrom] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL
,[ValidTo] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL
,PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
,CONSTRAINT PK_DatabaseSize_DatabaseID PRIMARY KEY CLUSTERED (DatabaseID)
) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));
Temporal tables must have a primary key defined. They also must contain two datetime2 columns, declared as GENERATED ALWAYS AS ROW START / END. The statement above creates both the current table and a history table.
The history table has the same schema as the current table, with one difference: the SizeMB column in the history table is not a computed column.
When I initially created the table, I typoed the formula in the computed column. You can’t alter a computed column — you must drop and recreate the column. This is no problem, just turn off system versioning and alter your table, and turn system versioning back on.
But if you try this without specifying your history table, you will find that it stops using the history table created earlier and makes a new history table.
If you specify your history table while turning system versioning back on, you will encounter an error:
Setting SYSTEM_VERSIONING to ON failed because column 'SizeMB' at ordinal 5 in history table 'Test.dbo.DatabaseSizeHistory' has a different name than the column 'ValidFrom' at the same ordinal in table 'Test.dbo.databasesize'.
Temporal tables match the columns between the current table and history table not only by name and data type but by the column’s ordinal position. Dropping and adding the computed column changed its order as it was added to the end of the table.
You can change the column order of a table in the SQL Server Management Studio UI by right-clicking on the table, selecting Design, and then dragging the column to the correct position. Note that you cannot do this on the system-versioned table while system versioning is on. You can either change the column order on the history table, or turn system versioning off and then change the current table.
Once the column orders match, you can turn system versioning back on and specify the the history table.
ALTER TABLE [dbo].[DatabaseSize]
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[DatabaseSizeHistory]));
This time, the command should complete successfully. You’ll want to drop any unused auto-created history tables before you’re finished.
Control Flow activities in Data Factory involve orchestration of pipeline activities including chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline. They also include custom-state passing and looping containers.
If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.
You cannot nest For Each activities. Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.
You cannot put a For Each activity or Switch activity inside of an If activity. The Data Factory UI will prevent you from doing this by removing the For Each and Switch from the activity list. You can redesign the pipeline to put the inner activity inside a child pipeline. Also note that you can put an If activity inside of a For Each activity.
You cannot use a Set Variable activity inside a For Each activity that runs in parallel. The Data Factory UI won’t stop you, but you’ll quickly learn that the scope of the variable is the pipeline and not the For Each or any other activity. So you’ll just overwrite the value in no particular order as the activities execute in parallel. The workaround for this is specific to your use case. You might try using an existing attribute of the item you are iterating on instead of setting a variable. Append Variable works fine, since each loop could add a value. But again, don’t count on the order being meaningful.
You cannot nest If activities. The Data Factory UI will prevent you from nesting the If activities. If you need to have two sets of conditions, you can either combine conditions or move the inner condition to a child pipeline.
You cannot nest Switch activities. Similar to the If activity, the Data Factory UI will prevent you from nesting Switch activities. And again, you can either combine conditions or move the inner condition to a child pipeline.
You cannot put a For Each or If activity inside a Switch activity. The Data Factory UI will prevent you from doing this. You can move the inner activity to a child pipeline if needed.
You cannot use an expression to populate the pipeline in an Execute Pipeline activity. It would be great to design a truly dynamic pipeline where you could have a dataset that defines which pipelines to execute, but you can’t do that natively in the Data Factory UI. The Invoked Pipeline property doesn’t allow dynamic expressions. If you need to dynamically execute pipelines, you can use Logic Apps or Azure Functions to execute the pipeline.
You cannot dynamically populate the variable name in Set Variable and Append Variable activities. The Data Factory UI only allows you to choose from a list of existing variables. As a workaround, you could use an If activity to determine which variable you will populate.
The Lookup activity has a maximum of 5,000 rows and a maximum size of 4 MB. If you need to iterate over more than 5000 rows, you’ll need to split your list between a child and parent pipeline.
In addition to the items mentioned above, also note these resource limits listed in Microsoft Docs. Limits like 40 activities per pipeline (including inner activities for containers) can bite you if you aren’t careful about implementing a modular design. And if you do have a modular design with lots of pipelines calling other pipelines, be aware that you are limited to 100 queued runs per pipeline and 1,000 concurrent pipeline activity runs per subscription per Azure Integration Runtime region. I don’t hit these limits too often, but I have hit them.
This is not to say you can’t create good solutions in Azure Data Factory—you absolutely can. But Data Factory has some limitations that you might not expect if you have experience working with other data integration/orchestration tools.
Have you hit any other limits that caused you to design your pipelines differently that you would like to share with others? If so, leave me a comment.
I’ve been working on a project for the last few months with a client who has chosen to implement Dremio in Azure. Dremio is a data lake engine that creates a semantic layer and supports interactive queries.
It uses Apache Arrow, Gandiva, and Parquet files under the hood. It runs on either Linux VMs or Kubernetes containers. Like most big data systems, there is at least one coordinator node and one or more executor nodes. These nodes communicate and are managed using Apache Zookeeper. Client applications connect to Dremio via ODBC, JDBC, REST APIs, or Arrow Flight. Dremio can read from storage accounts, external databases, and a few other sources.
Dremio stores data in the following places:
Metadata is stored in a RocksDB database on the coordinator node.
Frequently read data is cached on the executor node.
Memory-intensive query operations may cause an executor node to spill Arrow buffers from RAM to disk.
Reflections, user uploads, and query results are stored in the data lake.
Dremio is organized into spaces, which can contain folders and datasets. The key objects in Dremio are:
Data source – connection strings to data that should be accessed via Dremio
Physical Dataset – an HDFS directory or a database table
Virtual Dataset – a view of sorts, created using the Dremio UI or by writing SQL, that references one or more physical or virtual datasets and also provides lineage to its sources
Reflection – a materialized view that is transparent to users and is used to improve query performance, which seems to be implemented as Dremio querying data from the source and storing it as a parquet file for quicker access.
Space – a shared location for virtual datasets, a way to group related datasets and provide user access
Once you have your spaces and virtual datasets set up, it feels kind of like a database. If you connect with Power BI, virtual datasets appear as views and physical datasets appear as tables. Dremio metadata (catalogs, schemas, physical datasets, virtual datasets and columns) can be accessed using INFORMATION_SCHEMA queries, which is conveniently familiar if you are used to working with SQL Server.
Some nice features found in Dremio on Azure
Dremio allows Single Sign-On with AAD credentials. Permissions can be granted to individual users or AAD groups.
Dremio can be implemented in a virtual network in Azure. The executor nodes can use Private Link to access ADLS (Azure Data Lake Storage Gen 2) over a private endpoint.
Changes to virtual datasets are tracked in Dremio. It’s easy to revert to a previous version at any time.
Dremio gives you visibility to the jobs running queries, both for ad hoc queries from client tools and for refreshing reflections.
Administrators can create rules to assign queries to different queues in order to provide workload isolation and predictability for users.
When reviewing jobs, you can see a sort of query plan as well as which jobs were able to use a reflection to accelerate a query.
The lineage view for a virtual dataset is nice for understanding dependencies.
You can trigger refreshes of metadata or reflections via the Rest API, which is handy if you have ETL processes adding new data to your data lake, and you want to refreshes to occur at the end of the ETL process.
Some rough edges on Dremio in Azure
Dremio was initially built for AWS, not Azure. This is evident in the training materials, the product roadmap, and the knowledge of the Dremio implementation specialists. This is not to say it doesn’t work on Azure, just that the implementation is a bit rougher (e.g., no Azure templates made for you), and a couple of features are unavailable.
Dremio doesn’t integrate with Azure Key Vault. You store the service principal secret or storage account access key in a configuration file on the Linux VM. I’ve been told this is on the roadmap, but I didn’t hear a date when it would be available.
You can enable integration points on the Dremio website where you can click a button to open a connection to a virtual dataset in a BI tool such as Power BI or Tableau. For Power BI, this downloads a PBIDS file with a connection to that specific virtual dataset. This would be fine if everything you need is in this one dataset, but if you need to reference multiple virtual datasets, this is a bit annoying. Think of it like connecting to a specific database table instead of to the database in general. You might want to use that table, but you might also want to find other useful tables to combine in your Power BI model. You can open Power BI and connect to Dremio in general and navigate from there with no problems. I’m just pointing out that the buttons in the UI don’t seem that useful.
Dremio doesn’t support passthrough authentication on ADLS. All queries to the data lake are made in the context of the Dremio application, not the individual user. This means that you may need to set permissions twice for your data lake if you have other tools directly accessing the data lake instead of using Dremio. The idea is that most tools will connect through Dremio to take advantage of the semantic layer. But it would be nice to have, just to simplify security.
Advice we received in training
Unlike with nesting views in SQL Server, it’s ok to create multiple layers of virtual datasets. You want to design the semantic layer (the virtual datasets) to reuse common logic instead of repeating it across multiple views.
The standard design pattern for the semantic layer is to have a layer of “staging views” that have a 1-to-1 mapping to physical datasets and very little transformation outside of fixing data types and light cleansing. On top of the Staging layer is the Business layer, which includes virtual datasets containing business logic. The Business layer should handle most of the query workload. On top of the Business layer is the Application Layer. This includes virtual datasets that are purpose-built to support specific applications or reports.
Star schemas are not optimal in Dremio. You likely want to denormalize even more than that. This is because it is more expensive to perform a join than to search through a large number of values in a column.
When creating a reflection, setting the sort column is somewhat like creating an index in a SQL database. It helps prune data when applying a query filter or performing a join.
Reflections can be used to partition data. If you find you have a single large file, you can use a Reflection to split it by a low cardinality value to improve query performance. When you do this, it creates a parquet file per partition.
Reflections can be set to use an incremental refresh, but only if the data is additive and existing data is not updated.
You don’t need a reflection for everything. Make them as small and reusable a possible.
Try to avoid thousands of tiny files, and aim for a few medium to large files (MBs to GBs). This is common for most data lake engines as there is an overhead cost for file enumeration.
Some other thoughts
Dremio advertises that you don’t need data integration processes like you would for a data warehouse. I find this to be somewhat inaccurate for two main reasons. First, if you need to acquire data from APIs or other applications to which Dremio can’t connect, you will still need to copy data to your data lake. Second, when you use a Reflection to speed up a query, you are creating a copy of the data in your data lake stored as one or more Parquet files. Data virtualization technology hasn’t actually matured to the point of not needing ETL at all. I can see how Dremio would lessen the need for ETL, but let’s recognize that you’ll probably still need some and that Dremio is doing a bit of data loading of it’s own. So the question becomes where — and with what tools — you would like to do this. You can have Dremio do your transforming and loading in the form of reflections, or you can load your own data already transformed to the data lake. You will likely end up with a bit of both over time.
Consider the skillsets of the people who will manage the system, as well as those who will build and query the datasets. If you have a team of admins who only know Windows, they are going to need to skill up on Linux. If your BI team or analysts don’t know SQL, they will probably struggle to build the virtual datasets.
This system can get pretty expensive pretty fast (which is true of most big data systems). You’ll want to be sure to automate the shutdown of the nodes in dev and test environments when they are not in use, so you can save a bit of money. And remember that you can size up your nodes later if you find you don’t have adequate performance. Oversizing at the outset will waste money.
Dremio is a (well-funded) startup with a product that is built on several open source technologies, and they don’t seem to have a public roadmap. In my experience, they have been good about taking feedback to add to the roadmap and with sharing what is soon to be released. But if you are building your company’s BI strategy with Dremio as a key tool, you probably want more than that. It sounds like they share more with paying customers. I would want that information before making a purchasing decision.
Overall, I can see why Dremio has been adopted by several large companies. And I have enjoyed setting up the Azure architecture around it and building virtual datasets. I wish they would add some Azure-specific features to optimize things and make security easy, but it’s a promising platform.
More Information about Dremio
If Dremio sounds interesting to you, here are a few helpful links
Question: When an activity in a Data Factory pipeline fails, does the entire pipeline fail? Answer: It depends
In Azure Data Factory, a pipeline is a logical grouping of activities that together perform a task. It is the unit of execution – you schedule and execute a pipeline. Activities in a pipeline define actions to perform on your data. Activities can be categorized as data movement, data transformation, or control activities.
In many instances, when an activity fails during a pipeline run, the pipeline run will report failure as well. But this is not always the case.
There are two main scenarios where an activity would report failure, but the pipeline would report success:
The maximum number of retry attempts is greater than 0, and the initial activity execution fails but the second attempt succeeds
The failed activity has a failure path or a completion path to a subsequent activity and no success path
In the General settings of any activity is a property called Retry. This is the number of times Data Factory can try to execute the activity again if the initial execution fails. The default number of retries is 0. If we execute a pipeline containing one activity with the default Retry setting, the failure of the activity would cause the pipeline to fail.
I often set retries to a non-zero number in copy activities, lookups, and data flows in case there are transient issues that would cause a failure that might not be present if we waited 30 seconds and tried the activity again.
Dependency with a Failure Condition
Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed. If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.
But if we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.
Now let’s say we have a pipeline with 3 activities, where Activity1 has a success path to Activity2 and a failure path to Activity3. If Activity1 fails and Activity3 succeeds, the pipeline will fail. The presence of the success path alongside the failure path changes the outcome reported by the pipeline, even though the activity executions from the pipeline are the same as the previous scenario.
What This Means for Monitoring
This difference between pipeline and activity status has a few implications of which we should be aware as we monitor our data factories.
If we are using Azure Monitor alerts, we need to understand that setting an alert for pipeline failures doesn’t catch all activity failures. If there is a retry of an activity and the second attempt is successful, there would be an activity failure but no pipeline failure.
Conversely, if we set an alert to notify us of activity failures, and we have a pipeline designed with the try-catch pattern, we might get an alert about an activity failure, but the pipeline would still show success. You would need to look at the status of the activities within the pipeline execution to see the failure of which you were alerted.
For many of my implementations, just setting an alert to notify me when any activity failure occurs is fine. For others, I really only care if the pipeline fails. Sometimes I need to set more specific alerts where I choose only certain activities to monitor for failure.
You could also use the Data Factory SDK to roll your own monitoring solution. If you write PowerShell, C#, or Python, you can retrieve the status of any pipeline or activity run and take subsequent actions based upon the results.
What This Means for Pipeline Design
You may need to add activities to your pipelines to support your monitoring scenarios if you need something more customized than what is offered from Azure Monitor and don’t want to use the SDK.
If you have notification needs that Azure Monitor can’t accommodate, you could add an activity in your pipelines to send an email based upon your desired activity outcomes. You can cause that activity to execute using an activity dependency alone, or by combining it with a variable and an If Condition activity.
There are times where we may need a pipeline to fail even though we are using the try-catch pattern that results in pipeline success. In that case, I add an additional web activity to the end of my pipeline failure path that hits an invalid url like http://throwanerror. The failure of this activity will cause the pipeline to fail. Keep monitoring and notifications in mind as you design your pipelines so you are alerted as appropriate.
Azure Data Factory Activity and Pipeline Outcomes
To help clarify these concepts I made the below guide to Data Factory activity and pipeline outcomes. Feel free to share it with others. You can download it directly from this link. A text version that should be friendlier for screen readers can be found on this page.