Over the last few years, I’ve had a few people ask me why I don’t create two Twitter accounts so I can separate work and personal things.
I choose to use one account because I am more than a content feed, and I want to encourage others to be their whole selves. I want to acknowledge that we don’t work in isolation from what’s going on in the greater world.
I love learning things from people on Twitter. I find a lot of quick tips and SQL and Power BI blog posts from Twitter. I love seeing all the interesting data visualizations shared on Twitter.
I often start my mornings with caffeine and Twitter.
But I also love learning personal things about people. I love to see photos of your dogs and views while hiking. I like knowing that my friend Justin is really into wrenches. And my coworker Joey (who I followed on Twitter long before I worked with him) is really into cycling. And my friend Matt is into racing. I followed all these people before I met them in person. Many of my Twitter friends became “IRL” friends after we met at conferences.
I definitely lurked around the SQL community (Power BI didn’t exist yet) for a while before I ever worked up the courage to say anything to anyone. And I started out with mostly data/work-related tweets. But as time went on, I realized how much I appreciated when others shared personal info about themselves, that helped me get to know them better. And I became more comfortable sharing more about me. So now I’m sort of giving back, with both professional and personal information.
Note: I do this personal/professional crossover only on Twitter, because I have deemed that to be an appropriate audience for both. I don’t share many personal things on LinkedIn. And I don’t share professional things on Instagram or Facebook. That’s my personal preference.
Are there risks to this? Absolutely.
People might stop following me because they don’t care for the personal content. My political opinions or obsession with my dog might turn some people off. Someone might be offended by my tattoos or my occasional use of profanity (which I never direct at someone and use more to express frustration/excitement/surprise). Or they may tire of my frequent use of gifs.
I know that potential and current clients see what I tweet. And it can and does affect their opinion of me. But when you hire me as your consultant, you get my personality and personal experiences as well as my tech knowledge. So, I don’t see it as being so very different from meeting me in real life at a conference or at another social event.
So far, it’s been an overall positive experience of having IRL conversations with clients about something I tweeted that they found helpful or entertaining. I do make a point not to name specific clients or projects unless I have their permission to do so (sometimes there are legitimateexceptions). I respect client privacy and confidentiality online and in person.
Before I could get to this place, I had to be secure in myself and secure in my employment and professional network. I recognize that not everyone will like me. That is true in person and on Twitter. And that’s ok. If you want to unfollow me, I’m ok with that. If you want to engage in conversations only about work stuff, that’s great. Feel free to mute words to tune out the rest of my tweets (your best bets are “Colorado”, “Denver”, “dog”, “kayak”, and “hike”). If you want to talk only about dogs and nature and my adorable and efficient camper, that’s also fine.
If you dig deep enough, I’m sure you can find some tweet that wasn’t my best moment. I’m sure I’ve said something regrettable in the 14 years since I joined Twitter. But I’m going to extend myself a little grace and remember that I’m human. And I’ll accept feedback if you think something I’ve said was unkind or made you feel unwelcome.
There is also a risk that someone can use this personal info against me, for harassment or identity theft. That is a risk for anyone sharing anything personal online. For now, I have assessed the risks and the rewards, and I find the rewards to outweigh the risks. I may decide differently later.
Do I recommend this for you?
Here’s the thing: I don’t think there are absolutes when it comes to how people use social media. If it makes you happy and it’s not hurting anyone, it’s probably fine.
It’s important to me that we remember that the people who teach us Azure and SQL and Power BI are people with non-work interests and personal struggles and interesting life experiences. And their more personal content gives me ways to relate to them outside of technology. Sometimes I learn really useful things from them, like the right type of lubricant to fix a squeaky door. Sometimes I notice a current event in their life that I can use to start a conversation at a conference.
Basically, I choose to use Twitter in a more human way. It’s working pretty well for me so far. You can decide if you have the desire and ability to do that. When this way of interacting with people stops being rewarding for me, I’ll reassess and take a different approach.
I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.
Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.
Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").
This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.
Case sensitivity in SQL Server
SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.
Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.
CREATE TABLE [dbo].[Products] (
[ProductName] nvarchar(100) NOT NULL
)
GO
ALTER TABLE [dbo].[Products]
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)
I can insert my first row.
Insert into dbo.Products (ProductName)
values ('abc123')
Now I try to insert another row where the letters are capitalized.
Insert into dbo.Products (ProductName)
values ('ABC123')
This fails with the following error:
Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).
This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).
Potential solution
If you need to get to a list of case insensitive distinct values, you can do the following:
Create a column that contains the values of the ProductName converted to lower case.
Run the dropDuplicates on the lowercase column.
Drop the lowercase column.
You can see code on this Stack Overflow solution as an example (as always, please read it and use at your own risk).
Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.
For more formal enterprise Power BI development, many people have a checklist to ensure data acquisition and data modeling quality and performance. Fewer people have a checklist for their data visualization. I’d like to offer some ideas for quality checks on the visual design of your Power BI report. I’ll update this list as I get feedback or new ideas.
Checklist illustration by Manypixels Gallery on IconScout
The goal of my data visualization quality checklist is to ensure my report matches my intended message, audience, and navigation.
There are currently 4 sections to my PBI data viz quality check:
Message check
Squint test
Visual components check
Accessibility check
Message check
Can you explain the purpose/message of your report in a single sentence?
Can you explain how each visual on the page supports that purpose/message?
I use the term purpose more often when I have a report that supports more exploratory data viz, allowing users to filter and navigate to find their own meaning in their own decision contexts. Message is much easier to define in explanatory data viz, where I intend to communicate a (set of) conclusion(s). My purpose or message statement often involves defining my intended audience.
If you cannot define the purpose/message of your report page, your report may be unclear or unfocused. If you can’t identify how a visual supports the purpose/message, you may consider removing or changing the visual to improve clarity and usefulness.
Squint test
You can perform a squint test by taking a step back and squinting while viewing your report page. Alternatively, you can use a desktop application or browser add-in that blurs the page, simulating far sightedness. Looking at the blurry report page helps you evaluate the visual hierarchy.
What elements on the page stand out? Should they? Areas of high contrast in color or size stand out. People read larger things first.
Does the page seem balanced? Is there significantly more white space or more bright color on one side of the page?
Does the page background stand out more than the foreground?
When visually scanning an image-heavy page (which follows a Z-pattern in Western cultures), does the order of items on the page make sense? Did you position explanatory text before the chart that needs the explanation? If there are slicers, buttons, or other items that require interaction before reading the rest of the page, are they placed near the top left?
Is there enough space between the items on the page to keep the page from feeling overly busy? Is it easy to tell where one visual ends and another begins?
Visual components check
I have two levels of visual components checks: reviewing individual visuals and reviewing the visuals across a report page.
Individual visuals components check:
Do charts have descriptive, purposeful titles? If possible, state a conclusion in the title. Otherwise, make it very clear what people should expect to find in your charts so they can decide if it’s worth the effort to further analyze them.
Are chart backgrounds transparent or using low-saturation colors? We don’t want a background color standing out more than the data points in the chart.
Are bright colors reserved for highlighting items that need attention?
Are visual borders too dark or intense? If every chart has a border that contrasts highly from the background, it can take the focus away from the chart and impede the visual flow. We often don’t need to use borders at all because we can use whitespace for visual separation.
Does the chart use jargon or acronyms that are unfamiliar to your intended audience? Try to spell out words, add explanatory text, and/or include navigation links/buttons to a glossary to reduce the amount of effort it takes to understand the report.
Are colors used within and across the page easily distinguishable?
Are fonts used consistently, with only purposeful deviations?
If charts should be compared, are the axis scales set to facilitate a reasonable comparison?
Does the interactivity between elements on the page provide useful information?
Are visuals appropriately aligned? Misalignment can be distracting.
Accessibility Check
I have a more comprehensive accessibility checklist on my blog that I keep updated as new accessibility features and tools are released. Below are some important things you can check to ensure your report can be read by those with different visual, motor, and cognitive conditions.
Do text and visual components have sufficient color contrast (generally, 4.5:1 for text and 3:1 for graphical components)?
Is color used as the only means of conveying information?
Is tab order set on all non-decorative visuals in each page? Decorative items should be hidden in tab order.
Has alt text been added to all non-decorative items on the page?
Is key information only accessible through an interaction? If so, you may consider rearranging your visuals so they are pre-filtered to make the important conclusion more obvious.
If you try navigating your report with a keyboard, is the experience acceptable for keyboard-only users? Does accessing important information require too many key presses? Are there interactive actions that can’t be performed using a keyboard?
Do you have any video, audio, or animation that auto-plays or cannot be controlled by the report user?
You may have used the UNICHAR() function in DAX to return Unicode characters in DAX measures. If you haven’t yet read Chris Webb’s blog post on the topic, I recommend you do. But did you know there is a Power Query function that can return Unicode characters? This can be useful in cases when you want to assign a Unicode character to a categorical value.
In Power Query, you can use Character.FromNumber to return a Unicode character based upon the specified number.
I recently used this function in a Workout Wednesday for Power BI exercise to visualize music notes. I made a scatterplot that used eighth notes and eighth rests as the markers.
To create an eighth note (𝅘𝅥𝅮), I added Character.FromNumber(9834) to my Power Query expression. The eighth rest is Character.FromNumber(119102). There are many music-related Unicode characters available.
With the Character.FromNumber function, we can make a custom column that uses an If statement to determine the correct character to use for each row in a table.
One thing to note is that the query preview in the Power Query Editor doesn’t always render the Unicode values correctly. This doesn’t mean it won’t work in your report, though.
For instance, Unicode character 119102 doesn’t look like the eighth rest when viewed in the Power Query preview.
The question mark in a box represents a unicode character that could not be properly rendered
But after applying changes and sending the data into the dataset, you’ll see your expected output in the Data view in Power BI Desktop.
Output of the custom column containing Unicode characters in the Data view in Power BI Desktop
The Unicode character does show correctly in Power Query Online when creating a dataflow.
While DAX is the answer for generating music notes or emojis based upon aggregated values, Power Query can help you generate a Unicode value per row in a table by using the Character.FromNumber function.
I noticed Adam Saxton post a tip on the Guy in a Cube YouTube channel about publishing reports from Power BI Desktop for external users. According to Microsoft Docs (as of June 21, 2022), you can’t publish directly from Power BI Desktop to an external tenant. But Adam shows how that is now possible thanks to an update in Azure Active Directory.
To sign into another tenant in Power BI Desktop:
Click the Sign in button.
Enter your email address (i.e., username@domain) and click Continue.
When the sign-in dialog appears, select Use another account.
Select sign-in options.
Select Sign in to an organization
Enter the organization’s domain (e.g., mycompany.org) and select Next.
You will then be shown a list of users with your user signed in to the external tenant. Select that user.
Then you will be signed in with your B2B user to the external tenant. This allows you to build “thin” reports off of shared datasets as well as publish to PowerBI.com from Power BI Desktop!
It would still be better for usability if Power BI had a tenant switcher similar Azure Data Factory or the Azure Portal, but this works perfectly fine in the meantime.
A couple of minor gaps
As Adam notes in his video, sensitivity labels don’t get applied if you publish from Power BI Desktop using an external user.
Also, once you publish, the link in the success dialog won’t work because it doesn’t contain the CTID in the URL. If I click on the link labeled “Open ‘WoW2022 Week 24.pbix’ in Power BI” in the screenshot below, it will try to open the link as if the report were in my tenant.
Success dialog seen after successfully publishing from Power BI Desktop
That won’t work since the report is in another tenant, so next I’ll get the error “Sorry, we couldn’t find that report”. You can go to the external tenant (by including the CTID in your url) and see your report is published, so don’t worry too much about this.
This tip makes life a bit easier for me when I’m publishing multiple times in a row from Power BI desktop. It’s fewer clicks than using the Upload a file dialog in PowerBI.com. I hope Microsoft will round out the B2B features so all links generated by Power BI have the CTID in the url, and maybe add the tenant switcher, too.
I needed to pull some user data for an app that uses Intercom. While I will probably import the data using Data Factory or a function in the long term, I needed to pull some quick data in a refreshable manner to combine with other data already available in Power BI.
I faced two challenges in getting this code to work:
Intercom’s API uses cursor-based pagination when retrieving contacts
I needed this query to be refreshable in PowerBI.com so I could schedule a daily refresh.
If you have worked with the Web.Contents function in Power Query, you may be familiar with all the various ways you can use it that aren’t supported in a refresh on PowerBI.com. Chris Webb’s blog is a great source for this info, if you find yourself stuck with a query that works in Power BI Desktop but not in the service.
The Intercom API
In version 2.4 of the Intercom API, users are a type of contact. You must make an HTTP GET call to https://api.intercom.io/contacts and pass an access token and Accept:application/json in the headers.
In the query string, you can specify the number of contacts per page by specifying per_page=x where x is a number less than or equal to 150. If you have more than 150 contacts, you will need to handle the cursor-based pagination.
When you make the initial call, the API will return a JSON object that contains a total count of contacts and a record for pages. Expanding the pages record shows the current page, the number of contacts per page, and the total number of pages.
Data returned in the Power Query Editor when retrieving contacts from the Intercom API
Expanding the next record gives you page 2 with a starting_after ID.
The cursor used for pagination in the Intercom API as retrieved using Power Query
To get the next page of contacts, the API call would be https://api.intercom.io/contacts?per_page=150&starting_after=[the starting_after ID listed above].
The Power Query Queries
This blog post from Gil Raviv gave me some ideas where to start, but the code in that blog will not refresh in PowerBI.com. You cannot put the iterations and the Web.Contents call and the generated list all in one query if you want to use scheduled refresh.
I ended up creating one query and two functions to accomplish my goal. The second function is optional, but you may find it useful as the time values in the API response are listed as Unix timestamps and you probably want to convert them to datetime values.
The first function contains the API call with an input parameter for the starting_after ID.
//Web API call function
(startafter) as record =>
let
Source = Json.Document(Web.Contents("https://api.intercom.io/contacts",[Query=[per_page="150",starting_after=startafter],Headers=[Accept="application/json", Authorization="Bearer <your access token goes here>"]])),
data = try Source[data] otherwise null,
pages = Source[pages],
ttlpages = pages[total_pages],
nextkey = pages[next][starting_after],
next = try nextkey otherwise null,
res = [Data=data, Next=next,TotalPages = total_pages]
in
res
It’s important to make the url in the Web.Contents function be a static string. If it is concatenated or dynamic in any way, the query will not be able to refresh in the Power BI service. All the query string parameters can go in the Query arguments of the Web.Contents function. If you have multiple query string arguments, you can put them in brackets with a comma separating them. You can do the same with multiple headers.
This function attempts the API call and returns null if it encounters an error. Once the data is returned, it retrieves the specific JSON objects needed to return the data. The next two lines are used to retrieve the starting_after value. The function returns the contact data, starting_after value, and total_pages value.
The List.Generate call generates the rows I need to call my first function. It sets an iterator variable to 0 and then calls the function, which returns the first page of data along with the total pages and the starting_after ID. As long as data is returned, it makes the API call again with the previously returned starting_after ID. This creates a list of lists that can be converted into a table of records. Then the records can be expanded to fields.
I expanded several columns out into multiple columns. Then I adjusted the data types of my columns to the correct types (they came back as Any data type).
There were several columns that contained Unix timestamps. All of the custom function calls are returning the datetime version of those values. I needed to handle null values in the timestamp conversion, so I replaced all the null timestamps with 0, converted them, and then converted the datetime value of 1-Jan-1970 back to null. I did the replace for all 7 columns in one step. Then I removed the original columns that contained the Unix timestamp as they were not analytically relevant for me.
Below is my Unix timestamp to datetime conversion function.
(UnixTime as number) as datetime=>
let
DT = #datetime(1970, 1, 1, 0, 0, 0) + #duration(0, 0, 0, UnixTime)
in DT
Advice and warnings
When using an API key that must be passed in the headers, it is safest to use a custom connector. Otherwise, you have to embed your API key in the M code, as shown above. When the query is sent to Intercom, it is encrypted using HTTPS. But anyone that opens your PBIX file would have access to it. Your key will be captured in the Power BI logs. And anyone that can manage to intercept your web request and decrypt your traffic would have access to it. This is not ideal. But creating a custom connector requires more advanced code and a gateway to make it usable in the Power BI service. With either option, you will choose Anonymous authentication for the data source.
Be sure to use the RelativePath and Query options in the Web.Contents call. This is necessary to make the query refreshable in the service. The value passed to the first parameter of Web.Contents must be a static string and must be valid in itself (no errors returned).
After publishing your report, you’ll need to set the credentials for the dataset before you can refresh it. Be sure to check the Skip test connection box. Otherwise, your credentials update will fail.
The Skip test connection option for the web data source in the dataset settings in PowerBI.com
Even though we are using anonymous authentication, you can still choose an Organizational privacy level.
In my contacts list, if I want only app users from your Intercom contacts list, I needed to filter the results on role = “user” since contacts also includes leads. The Role attribute was in the custom attributes returned by the API.
It took a bit of experimentation to get this working, so I thought I would share what I found to work. As always, Power BI is always changing and a better way to do this may be available in the future.
If you would like this development experience to improve, here are some Power BI Ideas to vote for:
I have been in Azure DevOps pipelines a lot recently, helping clients set up automated releases. Many of my clients are not in a place where automated build and deploy of their SQL databases makes sense, so they deploy using change scripts that are reviewed before deployment.
We chose release pipelines over the YAML pipelines because it was easier to manage for the client and pretty quick to set up. While I had done this before, I had a couple of new challenges:
I was deploying to an Azure SQL managed instance that had no public endpoint.
There were multiple databases for which there may or may not be a change script to execute in each release.
This took a bit longer than I expected, and I enlisted my coworker Bill to help me work through the final details.
Setup
There are a few pre-requisites to make this work.
Self-hosted Agent
First, you need to run the job on a self-hosted agent that is located on a virtual machine that has access to the managed instance. This requires downloading and installing the agent on the VM. We ended up installing this on the same VM where we have our self-hosted integration runtime for Azure Data Factory, at least for deployment from dev to test. This was ok because the VM had enough resources to support both, and deployments are not long or frequent.
When you create a new self-hosted agent in Azure DevOps, a modal dialog appears, which provides you with a link to download the agent and a command line script to install and configure the agent.
Dialog showing the instructions to download and install the Azure Pipelines Self-hosted agent
We configured the agent to run as a service, allowing it to log on as Network Service. We then could validate it was running by opening the Services window and looking for Azure Pipelines Agent.
You must add an inbound firewall rule for the VM to allow the traffic from the IP addresses related to dev.azure.com. This can be done in the Azure portal on the Networking settings for the VM.
Then you must install the SqlServer module for PowerShell on the VM. And you must also install sqlpackage on the VM. We also updated the System Path variable to include the path to sqlpackage (which currently gets installed at C:\Program Files\Microsoft SQL Server\160\DAC\bin).
DevOps Permissions
Azure DevOps has a build agent with a service principal for each project. That service principal must have Contribute permissions on the repo in order to follow my steps.
Database Permissions
If you want the Azure DevOps build agent to execute a SQL script, the service principal must have appropriate permissions on the database(s). The service principal will likely need to create objects, execute stored procedures, write data, and read data. But this depends on the content of the change scripts you want to execute.
The Pipeline
In this project we had one pipeline to release some analytics databases and Azure Data Factory assets from dev to test.
Therefore, we included the latest version of the repos containing our database projects and data factory resources as artifacts in our pipeline. While the database objects and data factory resources were related, we decided they weren’t dependent upon each other. It was ok that one might fail and one succeed.
Side note: I love the ADF Tools extension for Azure DevOps from SQL Player. It is currently my preferred way to deploy Azure Data Factory. I find it is a much better experience than ARM template deployment. So that is what I used here in this ADF Deploy job.
Repo Setup
Getting back to databases, we had 3 databases to which we might want to deploy changes. We decided that we would set a standard of having one change script per database. We have an Azure DevOps repo that contains one solution with 3 database projects. These databases are highly related to each other, so the decision was made to keep them all in one repo. I added a folder to the repo called ChangeScripts. ChangeScripts contains two subfolders: Current and Previous. When we are ready to deploy database changes, we add a .sql file with the name of the database into the Current folder. For any deployment, there may be 1, 2, or 3 change scripts (1 for each database that has changes). When the release pipeline runs, the change scripts in the Current folder are executed and the files are renamed and moved into the Previous folder. This is why my DevOps service principal needed Contribute permissions on the repo.
Here are the tasks in my Database Deployment job:
Check if SQL Scripts exist in the expected location.
If a script exists for Database1, execute the script.
If a script exists for Database 2, execute the script.
If a script exists for Database 3, execute the script.
Rename the change scripts to fit the pattern [DatabaseName][yyyyMMddHHmm].sql and move them to the Previous folder.
Configuring the Tasks
The first thing we must do is set the agent job to run using the agent pool containing the self-hosted agent. There is another important setting under Additional options for Allow scripts to access the OAuth token. This allows that service principal to authenticate and execute the SQL scripts, so make sure that is checked.
My first task, called Check for SQL Scripts is a PowerShell task. I used an inline script, but you can also reference a file path that is accessible by the agent. My script looks for the three files and then sets a pipeline variable for as true or false to indicate the presence each file. This variable is used later in the process.
My task is configured with the working directory set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current. There are no environment variables or output variables.
We got stuck on the syntax for setting variables for a while. Most of the things we found in Microsoft Docs and on blogs were close, but didn’t work. Finally, we found this blog, which showed us the syntax that worked – be careful with the quotes, semi-colon, and parentheses.
My second task executes the script for Database1. It uses the Azure SQL Database deployment task. When you use this task, you can specify whether it deploys a dacpac or executes a SQL script. We chose to execute a SQL script that we stored in the same repo as the database projects. For this task, you must specify the Azure Subscription, the target server, and the target database. I set my authentication type to Service Principal, since I had granted access to the Azure DevOps service principal during setup. My Deploy type was set to SQL Script File. And my SQL Script location was set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts/Current/Database1.sql. An important configuration on this task is to set Run this task to Custom Conditions. I specified my custom condition as eq(variables['DB1Script'], 'True'). This tells the task to execute only if the variable we set in the previous task is equal to True. It seems data types don’t really matter here – I set the variable in the PowerShell script as a boolean, but it was converted to a string in Azure DevOps.
My third and fourth tasks do the same activities for Database2 and Database3.
My last task, called Move Change Scripts to Previous Folder, is another inline PowerShell script. The script uses the variables set in the first task to determine if the file exists and needs to be moved. Because the script is working on a local copy on the VM, it must commit the changes back to the git repo after renaming and moving the file(s). The Working Directory for this task is set to $(System.DefaultWorkingDirectory)/_Databases/ChangeScripts.
Anything you output from your PowerShell script using Write-Host is saved in the release logs. This is helpful for troubleshooting if something goes wrong.
As you can see above, you can execute git commands from PowerShell. This Stack Overflow question helped me understand the syntax.
And with those five steps, I’m able to execute change scripts for any or all of my three databases. We don’t anticipate adding many more databases, so this configuration is easily manageable for the situation. I wouldn’t do this (one step per database) for a situation in which I have dozens or hundreds of databases that may be created or deleted at any time.
I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.
Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.
You can see queue time in the ADF Monitor by looking at the output of an activity.
Output from a stored procedure activity showing 2 seconds of queue time
But what if you want to see activity queue times across activities, across pipelines, or even across data factories? Then you need to output your logs to somewhere that makes this easier.
Send ADF Logs to Log Analytics
You can output your ADF logs to a storage account, to Log Analytics, to an event hub, or to a partner solution. I prefer Log Analytics because it’s easy to query and look for trends using KQL.
To configure the output to Log Analytics, you must create a Log Analytics workspace (if you don’t have an existing one) and add a diagnostic setting to the data factory resource. Once you have data feeding into Log Analytics, you can query it.
If you choose resource-specific destination tables in the diagnostic setting, you will find a table in Log Analytics called ADFActivityRun. This table contains a column called Output. The Output column contains the JSON we see in the ADF Studio Monitor app.
KQL has functions for parsing JSON and retrieving only the JSON objects I want to include. This means that I could write a query like the following.
ADFActivityRun | extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue',Output, typeof(int)) | where Status == 'Succeeded' | where queuetime > 0 | project Start, End, PipelineName, ActivityType, ActivityName, dur = datetime_diff('second', End, Start), queuetime, PipelineRunId, ActivityRunId | sort by queuetime desc
This query gives me a list of activities with successful executions that have queue times greater than zero. I can choose from any columns in the ADFActivityRun table, including the pipeline and activity names, start and end times, activity types, and run IDs. Duration is not an available column so I had to calculate it by calculating the difference between the start and end time. The queue time is buried in the JSON in the Output column, so I used the extractjson function to get the duration in queue value.
Now that I know how to get the queue duration, I can look for trends across various slices of data. A query to get average queue time by activity type might look like the below.
ADFActivityRun | where Status == 'Succeeded' | where startofday(Start) == datetime(2022-01-04) | extend queuetime = extractjson('$.durationInQueue.integrationRuntimeQueue', Output, typeof(int)) | summarize avg_queuetime = avg(queuetime) by ActivityType | sort by avg_queuetime desc
In this query, I am retrieving activities with successful executions that started on January 4, 2022. I added my calculation to retrieve queue time. Then I calculated average queue time by activity type using the summarize operator and sorted the result descending by queue time.
I could have filtered on any other available activity: pipeline name, activity name, integration runtime, resource ID, and more.
This information is available via the API as well, but a Log Analytics workspace can be spun up and running in minutes without having to write code to connect.
I was working on an Azure Data Factory project for a client who is new to ADF, and there was a miscommunication about the new Git Repo to be used for source control. Someone had created a new project and repo instead of using the existing one created for this purpose. This isn’t a big deal, as it’s easy enough to change in ADF Studio.
The Management Hub in Azure Data Factory Studio contains the Git configuration settings
In the Management Hub, you can change the Git configuration for the data factory, and there is a button near the top to disconnect the repo. You may need to do this if you run into conflicts in the publish branch or when you need to change repos.
It should be as easy as that, but I ran into a situation where it wasn’t.
Disconnect Button Unavailable
When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.
I asked my client to disconnect the repo and moved on with the project, but I also logged some feedback for the Data Factory team. You can do this by selecting the Feedback button near the top right of the page in Azure Data Factory Studio. I have done this a few times and always received a prompt response.
You can send feedback from within Azure Data Factory Studio
I sent my feedback and received an email response within a couple of days. It contained a few helpful bits of information.
To disconnect git from the ADF management tab, you must be in git mode and be able to access the repository. If you are not able to access the repository but have permissions to update the factory, you can remove the git configuration via the REST API. Here is documentation on this API call: Factories – Configure Factory Repo – REST API (Azure Data Factory) | Microsoft Docs. In this case, repoConfiuration should be set to null in the request body, and the rest of the PUT body should be the same as the existing settings.
Alternatively, the git connection can be removed from the ADF management tab by either gaining access to the repository or having another ADF user with access remove the configuration. I also understand that this is not an ideal experience, so I have filed a work item to remove the requirement to have access to the repository.
This identified my situation exactly. I did not have access to the new repo that had been connected to the data factory. That caused me to be unable to disconnect it.
Hopefully the work item that removes the requirement for repo access before you can disconnect the data factory will be completed soon, but if you run into this issue, you can resolve it on your own using the API or a colleague with repo access.
I recently worked on a project where we needed to copy some large files from a specified library in SharePoint Online. In that library, there were several layers of folders and many different types of files. My goal was to copy files that had a certain file extension and a file name that started with a specific string.
I began by comparing the capabilities between Azure Data Factory and Logic Apps. Logic Apps allows me to start the process via a webhook. It also has a SharePoint online connector that provides easy AAD authentication. And there is a special operation that allows me to retrieve the properties (path, filename) of files without retrieving the files. This makes it easy to loop through the list of files that is already filtered to only the files I want.
Logic App that retrieves file metadata from SharePoint and then performs an action for each file in the list
Azure Data Factory does have a SharePoint Online connector, but you can’t use it to copy files. You must use the HTTP connector to copy files from SharePoint in Data Factory. And that requires you to make a web call first to get a bearer token to be used in the authentication for the source of the copy activity.
While the ADF options work, I thought I would go with the Logic App because it felt cleaner and easier for others to understand. That is until I tried to copy some large files and encountered the following error.
Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'.
It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.
At that point I would have to figure out how to handle the chunking using an HTTP call. That made the Logic App no better than Data Factory for my purposes. And since I already had part of the larger process in Data Factory, I went back there to see how it handled chunking.
Copying a Subset of Files from SharePoint Online Using Data Factory
And it turns out you don’t have to do anything special in Data Factory to handle the chunking of the large files – it does that work for you.
So I set up a similar process to what I had in my Logic App in my ADF pipeline.
ADF pipeline that retrieves a list of files in a SharePoint library and performs an action on each file
First, I used a Lookup activity to get the list of files. When you do a lookup against a SharePoint Online dataset, you can use OData filters to limit the data returned by the lookup. In my case this was necessary because I only wanted 144 of the items out of the list of 4300. Lookups can only return 5,000 rows, so I definitely needed to add filters to my query. Also, folders are considered an item in the data returned, so if you are looking only for files, you need to filter out folders (either by content type or by the name of the file).
Data Factory Lookup against a SharePoint Online list using an OData query with filters
The results of the lookup, returning one item per file, are fed into a For Each activity. Inside the For Each activity, I placed a Web activity to get the necessary bearer token and a Copy activity to copy the file to blob storage.
Inside the For Each activity is a Web activity to retrieve a bearer token, followed by a Copy activity that copies the files to blob storage.
To get a bearer token, you need a service principal with the correct permissions in SharePoint. Then you make a POST to https://accounts.accesscontrol.windows.net/{tenantID}/tokens/OAuth/2.
You need the following header: Content-Type: application/x-www-form-urlencoded.
In the body of the HTTP call, you must include: grant_type=client_credentials&client_id={clientid}@{tenantid}&client_secret={secretvalue}&resource=00000003-0000-0ff1-ce00-000000000000/{SharePoint tenant}.sharepoint.com@{tenantID}.
Make sure to set Secure Output so your credentials aren’t written to the ADF logs. Preferably, you would store the client ID and client secret in Key Vault and use a web activity to look them up before getting the token.
You can then use the output from the web activity in your copy activity.
Source settings for a Copy activity in ADF that copies files from SharePoint
The Additional Headers should be populated with the auth token as shown below:
Since I don’t need to read the file contents, just copy it to blob storage, I used a binary dataset. I parameterized the dataset and linked service so my For Each activity could feed it the path and name of each file. This is because the lookup provides the path separately from the file name (which includes the file extension).
Binary dataset using an HTTP linked service used for retrieve files from SharePoint online
The linked service uses the HTTP connector. It has two parameters added: path and name.
HTTP linked service used to retrieve files in a SharePoint Online library in Data Factory
If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.
So once I add my destination info to the Copy Activity, I can copy all the files retrieved from my lookup to blob storage.
A Few Tricky Things
Everyone’s SharePoint tenant is set up differently. Your url will vary based upon the configuration of your site collections/sites. When I tried this in my tenant, I didn’t have to include the “site/sitename” before “/_api/web…”. But I did in my client’s tenant when I was building this.
We granted permissions to the service principal at the site collection level to get this to work. If there was a way to grant lower permissions, we couldn’t figure it out. Here the XML we used when the SharePoint admin was granting permissions.
Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error. That is why I moved my web activity inside the For Each activity. If you know it’s only a few files and they are quick, you should just get one bearer token and use it for all the files.
Lastly, be sure to set retries and timeouts appropriately. Remember that the default timeout for an activity is 7 days, and that is probably not desirable.
It took a lot of trial and error for me to get this working, as I couldn’t find many blog posts or docs about this. Special thanks to Bill Fellows for helping me through it. I sincerely hope this blog post helps someone avoid my struggles.