Data Factory can be a great tool for cloud and hybrid data integration. But since its inception, it was less than straightforward how we should move data (copy to another location and delete the original copy).
It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.
But now Data Factory V2 has a Delete activity.
How the Delete Activity Works
The Delete activity can delete from the following data stores:
Azure blob storage
ADLS Gen 1
ADLS Gen 2
You can delete files or folders. You can also specifiy whether you want to delete recursively (delete including all subfolders of the specified folder). If you want to delete files/folders from a file system on a private network (e.g., on premises), you will need to use a self-hosted integration runtime running version 3.14 or higher. Data Factory will need write access to your data store in order to perform the delete.
You can log the deleted file names as part of the Delete activity. It requires you to provide a blob storage or ADLS Gen 1 or 2 account as a place to write the logs.
You can parameterize the following properties in the Delete activity itself:
Delete file recursively
Max concurrent connections
Logging folder path
You can also parameterize your dataset as usual.
All that’s required in the Delete activity is an activity name and dataset. The other properties are optional. Just be sure you have specified the appropriate file path. Maybe try this out in dev before you accidentally delete your way through prod.
To delete all contents of a folder (including subfolders), specify the folder path in your dataset and leave the file name blank, then check the box for “Delete file recursively”.
As you build your Power BI reports, you may want to use maps and custom visuals. Have you thought about data privacy and what data is getting shared by those visuals? If you have sensitive data in your reports, you will probably want to look into this.
Most built-in visuals do not share data externally. But the default map visuals in Power BI need to share data with Bing Maps in order to geocode data points. Microsoft has documented that what is shared depends on the map type and the type of location data used.
For bubble maps, no data is sent to Bing if you are using only longitude and latitude. Otherwise, any data in the Location and filter buckets is sent to Bing.
For filled maps, data in the Location, Longitude, and Latitude buckets is shared with Bing.
For ArcGIS maps, Esri staff have said “Only the data needed to geocode the address (i.e., fields placed in the Location field well) are passed to Esri servers. These data are only used to generate the information used to place the locations on the map and they are not stored by Esri servers.”
Custom visuals are created by developers using the custom visuals SDK. There are 3 ways to deploy custom visuals for use by report builders:
Sharing a .pbiviz file
Adding to the organizational visuals tenant repository
Having users download visuals from the marketplace (AppSource)
When you receive and use a .pbiviz file, you are taking responsibility for assessing data security. When your Power BI admin deploys a custom visual to the organizational visuals repository, they are approving the visual for use inside your organization.
If you are using visuals from the marketplace, you will need to check the information provided about data privacy, and it’s not all that straightforward at the moment.
One thing that makes understanding data privacy in custom visuals easier is the designation of a certified custom visual. One of the requirements for certification is ” Does not access external services or resources, including but not limited to, no HTTP/S or WebSocket requests go out of Power BI to any services.”
You can find the list of currently certified custom visuals on this page. Custom visuals are also identified in the marketplace by a blue star with a check mark.
Uncertified visuals are not necessarily less secure than custom visuals, but they have not been tested by Microsoft to confirm security. Any random person can create a custom visual, which is pretty cool and also potentially dangerous for data security.
Microsoft has tried to remind you of this in App Source. On each visual that is not certified, you will see a notice, such as the one below.
This is helpful, but there are a couple of problems.
This information is at the bottom of the the visual description. Once you select a visual from the list, you most likely need to scroll down to see this note.
This is generic, boilerplate language added by AppSource. They are basically saying that it is possible that the visual might send data over the internet. They are not telling you that it definitely does.
What Have We Learned?
Determining what data is sent externally by a custom visual is not simple. While many visuals are sandboxed and do not communicate externally, some of them do, and any uncertified custom visual might.
Custom Visual Creators
In case you aren’t familiar, I would like to introduce you to the violin plot.
A violin plot is a nifty chart that shows both distribution and density of data. It’s essentially a box plot with a density plot on each side. Box plots are a common way to show variation in data, but their limitation is that you can’t see frequency of values. In other words, you can see statistics such as min, max, median, mean, or quartiles, but you can’t see the individual values nor how often they occurred.
The violin plot overcomes this limitation (by adding the density plot) without taking up much more room on the canvas.
In Power BI, you can quickly make a violin plot using a custom visual. The Violin Plot custom visual (created by Daniel Marsh-Patrick) has many useful formatting options. First, you can choose to turn off the box plot and just show the density plot. Or you can choose to trade the box plot for a barcode plot.
Formatting the Violin Plot
There are several sections of formatting for this visual. I’ll call out a few important options here. First, the Violin Options allow you to change the following settings related to the density plot portion of the violin plot.
Inner padding controls the space between each violin. Stroke width changes the width of the outline of the density plot. The sampling resolution controls the detail in the outline of the density plot. Check out Wikipedia to learn more about the kernel density estimation options.
The Sorting section allows you to choose the display order of the plots. In the example above, the sort order is set to sort by category. You can then choose whether the items should be sorted ascending or descending.
Next you can choose the color and transparency of your density plot. You have the ability to choose a different color for each plot (violin), but please don’t unless you have a good reason to do so.
The Combo Plot section controls the look of the bar code plot or box plot. Inner padding determines the width of the plot. Stroke width controls the width of the individual lines in the bar code plot, or the outline and whiskers in the box plot. You can change the box or bar color in this section. For the barcode plot, you can choose whether you would like to show the first and third quartiles and the median the color, line thickness, and line style of their markers.
Also make sure to check out the Tooltip section. It allows you to show various statistics in the tooltip without having to calculate them in DAX or show them elsewhere in the visual.
Violin Plot Custom Visual Issues & Limitations
This is a well designed custom visual, but there are a couple of small things I hope will be enhanced in the future.
The mean and standard deviation in the tooltip are not rounded to a reasonable amount of digits after the decimal.
The visual does not seem to respond to the Show Data keyboard command that places data in a screen reader friendly table.
As always, make sure to read the fine print about what each custom visual is allowed to do. Make sure you understand the permissions you are granting and that you and your organization are ok with them. For example, I used public weather data in my violin plot, so I had no concerns about sending the data over the internet. I would be more cautious if I were dealing with something more sensitive like patient data in a hospital.
Introducing the Violin Plot to Your Users
I think violin plots (especially the flavor with the bar code plot) are fairly easy to read once you have seen one, but many people may not be familiar with them. In my weather example above, I made an extra legend to help explain what the various colors of lines mean.
Another thing you might consider is adding an explainer on how to read the chart. I used a violin plot with a coworker who does not nerd out on data viz to show query costs from queries executed in SQL Server, and I added an image that explains how to read the chart.
After all, we use data visualization to analyze and present data effectively. If our users don’t understand it, we aren’t doing our job well.
Have you used the violin plot in Power BI? Leave me a comment about what kind of data you used it with and how you liked the resulting visual.
It’s not always obvious when you need a data gateway in Azure, and not all gateways are labeled as such. So I thought I would walk through various applications that act as a data gateway and discuss when, where, and how many are needed.
Note: I’m ignoring VPN gateways and application gateways for the rest of this post. You could set up a VPN gateway to get rid of some of the data gateways, but I’m assuming your networking/VPN situation is fixed at this point and working from there. This is a common case in my experience as many organizations are not ready to jump into Express Route when first planning their BI architecture.
Let’s start with what services may require you to use a data gateway.
You will need a data gateway when you are using Power BI, Azure Analysis Services, PowerApps, Microsoft Flow, Azure Logic Apps, Azure Data Factory, or Azure ML with a data source/destination that is in a private network that isn’t connected to your Azure subscription with a VPN gateway. Note that a private network includes on-premises data sources and Azure Virtual Machines as well as Azure SQL Databases and Azure SQL Data Warehouses that require use of VNet service endpoints rather than public endpoints.
and you have a data source in a private network, you need at least one gateway. But there are a few considerations that might cause you to set up more gateways.
Your services must be in the same region to use the same gateway. This means that your Power BI/Office 365 region and Azure region for your Azure Analysis Services resource must match for them to all use one gateway. If you have resources in different regions, you will need one gateway per region.
You may want high availability for your gateway. You can create high availability clusters so when one gateway is down, traffic is rerouted to another available gateway in the cluster.
You may want to segment traffic to ensure the necessary resources for certain ad hoc live/direct queries or scheduled refreshes. If your usage and refresh patterns warrant it, you may want to set up one gateway for scheduled refreshes and one gateway for live/direct queries back to any on-premises data sources. Or you might make sure live/direct queries for two different high-traffic models go through different gateways so as not to block each other. This isn’t always warranted, but it can be a good strategy.
Data Factory Self-hosted Integration Runtime
If you are using Azure Data Factory (V1 or V2) or Azure ML with a data source in a private network, you will need at least one gateway. But that gateway is called a Self-hosted Integration Runtime (IR).
Self-hosted IRs can be shared across data factories in the same Azure Active Directory tenant. They can be associated with up to four machines to scale out or provide higher availability. So while you may only need one node, you might want a second so that your IR is not the single point of failure.
Or you may want multiple IRs to boost throughput of copy activities. For instance, copying from an on-premises file server with one IR node is about 195 Megabytes per second (MB/s). But with 4 IR nodes, it can be as fast as 505 MB/s.
Factors that Affect the Number of Data Gateways Needed
The main factors determining the number of gateways you need
Number of data sources in private networks
(including Azure VNets)
Location of services in Azure and O365 (number
of regions and tenants)
Desire for high availability
Desire for increased throughput or segmented
If you are importing your data to Azure and using an Azure
SQL DB with no VNet as the source for your Power BI model, you won’t need an On
Premises Data Gateway. If you used Data Factory to copy your data from an on-premises
SQL Server to Azure Data Lake and then Azure SQL DB, you need a Self-Hosted
If all your source data is already in Azure, and your source
for Power BI or Azure Analysis Services is Azure SQL DW on a VNet, you will
need at least one On-Premises Data Gateway.
If you import a lot of data to Azure every day using Data Factory, and you land that data to Azure SQL DW on a VNet, then use Azure Analysis Services as the data source for Power BI reports, you might want a self-hosted integration runtime with a few nodes and a couple of on-premises gateways clustered for high availability.
Have a Plan For Your Gateways
The gateways/integration runtimes are not hard to install. They are just often not considered, and projects get stalled waiting until a machine is provisioned to install them on. And many people forget to plan for high availability in their gateways. Make sure you have the right number of gateways and IR nodes to get your desired features and connectivity. You can add gateways/nodes later, but you don’t want to get caught with no high availability when it really matters.
I’m honored to have one of my PASS Summit sessions chosen to be part of the PASS Data Expert Series on February 7. PASS has curated the top-rated, most impactful sessions from PASS Summit 2018 for a day of solutions and best practices to help keep you at the top of your field. There are three tracks: Analytics, Data Management, and Architecture. My session is in the Analytics track along with some other great sessions from Alberto Ferrari, Jen Underwood, Carlos Bossy, Matt How, and Richard Campbell.
The video for my session, titled “Do Your Data Visualizations Need a Makeover?”, starts at 16:00 UTC (9 AM MT). I’ll be online in the webinar for live Q&A and chat related to the session.
I hope you’ll register and chat with me about data visualizations in need of a makeover on February 7.
I started this blog in 2013 during my first consulting job. In 2014, I joined BlueGranite, which is where I have been for the last 4 years. The knowledge gained while working with them has been the inspiration for a lot of the content on this blog. It has truly been a pleasure to work at BlueGranite. Work is fun when your leadership team is trustworthy and transparent and open to feedback, and your coworkers are smart and motivated and want to help you learn. I can’t express how much I appreciate the professional relationships and friendships formed during my time with them.
But an interesting opportunity presented itself to me, and I have decided to take it.
Today is my first day working for Denny Cherry & Associates! I’m excited to be working with Denny (B|T), Joey (B|T), Kerry (B|T), Monica (B|T), John (B|T), and Peter (T). I’ve known the consultants at Denny Cherry & Associates for years through the SQL Community. My chats with them at conferences and on twitter demonstrated their incredible SQL and Azure knowledge, and I can’t wait to learn from them. I might even teach them a thing or two about BI/Analytics. I feel very fortunate to work for a company that values and encourages participation in the technical community. There are five Microsoft MVPs (including me), and all 6 consultants are speakers and bloggers. DCAC does a healthy mix of implementations, migrations, health checks, support, and training. They have completed many interesting and impactful Azure projects, which was a big draw for me.
Otherwise, things should be business as usual for me: work from home with the bulldog and a little travel as needed. I’ll continue to blog here in my free time with new coworkers and projects to serve as my inspiration.
When we talk about Azure architectures for data warehousing or analytics, we usually show a diagram that looks like the below.
This diagram is a great start to explain what services will be used in Azure to build out a solution or platform. But many times, we add the specific resource names and stop there. If you have built several projects in Azure, you will know there some other things for which you will need to plan. So what’s missing?
Azure Active Directory
Let’s start with Azure Active Directory (AAD). In order to provision the resources in the diagram, your Azure subscription must already be associated with an Active Directory. AAD is Microsoft’s cloud-based identity and access management service. Members of an organization have a user account that can sign in to various services. AAD is used to access Office 365, Power BI, and Dynamics 365, as well as the Azure portal. It can also be used to grant access and permissions to specific Azure resources.
For instance, users who will participate in Data Factory pipeline development must be assigned to the Data Factory Contributor role in Azure. Users can authenticate via AAD to log in to Azure SQL DW. You can use AD Groups to grant permissions to users interacting with various Azure resources such as Azure SQL DW or SQL DB as well as Azure Analysis Services and Power BI. It’s more efficient to manage permissions for groups than to do it for each user. You’ll need a plan to manage group membership.
You may also need to register applications in Azure Active Directory to allow Azure services to authenticate and interact with each other. While the guidance for many services is now to use managed identities, this may not be available for every situation just yet.
Virtual Networks (or VNets) allow many types of Azure
resources to securely communicate with each other, the internet, and
on-premises networks. You can have multiple virtual networks in an Azure
subscription. Each virtual network is isolated from other VNets, unless you set
up VNet peering.
Some PaaS services such as Azure Storage Accounts, Azure SQL DW, and Azure Analysis Services support Virtual Network Service Endpoints. A common usage scenario is to set up VNets and VNet Service Endpoints to connect resources to on-premises networks. Some organizations prefer to use VNet Service Endpoints instead of public service endpoints, making it so that traffic can only access the resource from within the organization’s local network.
In order to connect a VNet to an on-premises network or another VNet (outside of peering), you’ll need a VPN Gateway. You’ll need to identify the most appropriate type of VPN Gateway: Point-to-Site, Site-to-Site, or Express Route. These offerings differ based on bandwidth, protocols supported, routing, connection resiliency, and SLAs. Pricing can vary greatly based upon your gateway type.
While VNets and VPN Gateways are probably the most common
networking resources in Azure, there are many
other networking services and related design decisions to consider as you
plan an Azure deployment.
Your BI solution may be entirely in Azure, but if you need to retrieve data from data sources in a private network (on premises or on a VNet), you’ll need a gateway. If you are using Azure Data Factory, you’ll need a Self-hosted Integration Runtime (IR). If the source for your Power BI or Azure Analysis Services model is on a private network, you’ll need an On-Premises Data Gateway. You can use the same gateway to connect to Analysis Services, Power BI, Logic Apps, Power Apps, and Flow. If you will have a mix of Analysis Services and Power BI models sharing the same gateway, you’ll need to make sure that your Power BI region and your AAS region match.
These gateways require that you have a machine inside the private network on which to install them. And if you want to scale out or have failover capabilities, you may need multiple gateways and multiple VMs. So while you may be building a solution in Azure, you might end up with a few on-premises VMs to be able to securely move source data.
Dev and Test Environments
Our nice and tidy
diagram above is only showing production. We also need at least a development
environment and maybe one or more test environments. You’ll need to decide how
to design your dev/test environments. You may want duplicate resources in a
separate resource group; e.g. a dev resource group that contains a dev Data Factory,
a dev Azure SQL DW, a dev Azure Analysis Services, etc. While separating
environments by resource group is common, it’s not your only option. You will
need to decide if you prefer to separate environments by resource group,
subscription, directory, or some combination of the three.
You’ll also want to investigate
ways to keep the costs of non-prod environments down via smaller-sized
resources or pausing or deleting resources where applicable.
Plan For, Don’t Worry About, the Extras
There are several other ancillary Azure services that could/should
be part of your solution.
For source control, GitHub and Azure Repos have the easiest integration, especially with Azure Data Factory. You’ll not only want source control for things like database projects and Data Factory pipelines, but also possibly for ARM templates and PowerShell scripts used to deploy resources (think: infrastructure as code).
You’ll want to set up Azure Monitoring, including alerts to let you know when processes and services are not running as intended.
If you want more cost management support than just setting a spending limit on a subscription (if it is available for your subscription type), it may be helpful to set budgets in Azure so you can be notified when you hit certain percentages or amounts.
Be sure to think through the entire data/solution management
lifecycle. You may want to include extra services for cataloging, governing,
and archiving your data.
This may sound like a complex list, but these resources and services are fairly easy to work with. Azure Active Directory has a user-friendly GUI in the portal, and commands can be automated with PowerShell, requiring relatively little effort to manage users and groups to support your BI environment. VNets and VPN Gateways are a little more complex, but there are step-by-step instructions available for many tasks in the Microsoft Docs. The Power BI Gateway and ADF IR have quick and easy GUI installers that just take a few minutes. You can automate Azure deployments with Azure Pipelines or PowerShell scripts.
None of these things
are really that awful to implement or manage in Azure, unless you weren’t aware
of them and your project is paused until you can get them set up.
Is there anything else you find is commonly left out when planning for BI solutions in Azure? Leave me a comment if you would like to add to the list.
Update (1/26/19): Helpful readers have commented on other aspects of your Azure BI architecture they felt were often overlooked:
Make sure you have a plan for how to process your Azure Analysis Services model. Will you use Azure Automation? Call an Azure Function from Data Factory?
Be sure to organize and tag your resources appropriately to help you understand and control costs.
Don’t forget Azure Key Vault. This will help you keep keys and passwords secure.
(Thanks to Chad Toney, Shannon Holck, and Santiago Cepas for these suggestions.)