Azure, Databricks, Unity Catalog

External tables and views in Azure Databricks Unity Catalog

I’ve been busy defining objects in my Unity Catalog metastore to create a secure exploratory environment for analysts and data scientists. I’ve found a lack of examples for doing this in Azure with file types other than delta (maybe you’re reading this in the future and this is no longer a problem, but it was when I wrote this). So I wanted to get some more examples out there in case it helps others.

I’m not storing any data in Databricks – I’m leaving my data in the data lake and using Unity Catalog to put a tabular schema on top of it (hence the use of external tables vs managed tables. In order to reference an ADLS account, you need to define a storage credential and an external location.

External tables

External tables in Databricks are similar to external tables in SQL Server. We can use them to reference a file or folder that contains files with similar schemas. External tables can use the following file formats:

  • delta
  • csv
  • json
  • avro
  • parquet
  • orc
  • text

If you don’t specify the file format in the USING clause of your DDL statement, it will use the default of delta.

Below is an example of creating an external table from a single CSV file.

CREATE TABLE mycatalog.myschema.external_table1
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION 'abfss://containername@storageaccountname.dfs.core.windows.net/TopFolder/SecondFolder/myfile.csv';

Because I have used the LOCATION clause, this is an external table that stores just metadata. This SQL locates the specified file in my data lake and has Databricks create the schema based upon that file instead of me defining each column. Notice that I have specified in the options on the third line that there is a header row in my file and that Databricks should figure out the schema of the table.

Alternatively, I could explicitly define the columns for my external table. You can find the list of supported data types here.

CREATE TABLE mycatalog.myschema.external_table1 
(
  colA  INT,
  colB  STRING,
  colC  STRING
)
USING CSV 
OPTIONS (header "true") 
LOCATION 'abfss://containername@storageaccountname.dfs.core.windows.net/TopFolder/SecondFolder/myfile.csv';

External views

I had some JSON data in my lake that Databricks couldn’t automatically convert to a table so I created some external views. My data had a format similar to the below, with each document containing a single array that contained multiple objects, some of which were nested.

{
    "mystuff": [
        {
            "cola": "1",
            "colb": "2",
            "colc": "abc",
            "nestedthing": {
                "id": 1,
                "name": "thing1"
            }
        },
        {
            "cola": "2",
            "colb": "4",
            "colc": "def",
            "nestedthing": {
                "id": 22,
                "name": "thing22"
            }
        },
        {
            "cola": "3",
            "colb": "6",
            "colc": "ghi"
        }
    ]
}

The example view below directly queries a file in the data lake.

CREATE VIEW mycatalog.myschema.external_view1
select
  src.cola,
  src.colb,
  src.colc,
  src.nestedthing.id,
  src.nestedthing.name
FROM
  (
    select
      explode(mystuff) src
    FROM
      json.`abfss://containername@storageaccountname.dfs.core.windows.net/TopFolder/SecondFolder/myfile2.json`
  ) x

To reference the file in the data lake in the FROM clause of the query, we specify the file format first (JSON) followed by a dot and then the file path surround by backticks (not single quotes). If we needed to reference a folder instead we would just end the path at the folder name (no trailing slash is necessary).

The explode() function is great for turning objects in an array into columns in a tabular dataset. To access nested objects, you can use dot notation. If you need to parse more complex JSON, this is a helpful resource.

The query from the view above creates the following output.

A table containing 5 columns: cola, colb, colc, id, name.

I’m not sure yet if there are any consequences (performance? security?) of defining a view like this rather than first creating an external table. I couldn’t get the external table created without modifying the JSON files, which I was trying to avoid. I do the view produces the correct results. If you have experimented with this, let me know what you learned.

Azure, Azure Data Factory, Microsoft Technologies

Use the output of a Script activity as the items in a ForEach activity in Data Factory

In early 2022, Microsoft released a new activity in Azure Data Factory (ADF) called the Script activity. The Script activity allows you to execute one or more SQL statements and receive zero, one, or multiple result sets as the output. This is an advantage over the stored procedure activity that was already available in ADF, as the stored procedure activity doesn’t support using the result set returned from a query in a downstream activity.

However, when I went to find examples of how to reference those result sets, the documentation was lacking. It currently just says:

“For consuming activity output resultSets in down stream activity please refer to the Lookup activity result documentation.”

Microsoft Learn documentation on the Script activity

Populate the items in the ForEach activity

Similar to a Lookup activity, the Script activity can be used to populate the items in a ForEach activity, but the syntax is a bit different.

Let’s say we have a Script activity followed by a ForEach activity. The Script activity has a single result set.

A Script activity in Azure Data Factory with a ForEach activity

When I populate the items property of my ForEach activity, I use the following expression:
@activity('SCR_ScriptActivity').output.resultSets[0].rows

It starts similar to how we reference output from a Lookup activity. I reference the activity and then the output. But then instead of values I use resultSets[0].rows.

This makes sense when you look at the output from the activity.

{
   "resultSetCount":1,
   "recordsAffected":0,
   "resultSets":[
      {
         "rowCount":4,
         "rows":[
            {
               "colA":1
            },
            {
               "colA":2
            },
            {
               "colA":3
            },
            {
               "colA":4
            }
         ]
      }
   ],
   "outputParameters":{
      
   },
   "outputLogs":"",
   "outputLogsLocation":"",
   "outputTruncated":false,
   "effectiveIntegrationRuntime":"AutoResolveIntegrationRuntime (East US)",
   "executionDuration":1,
   "durationInQueue":{
      "integrationRuntimeQueue":1
   },
   "billingReference":{
      "activityType":"PipelineActivity",
      "billableDuration":[
         {
            "meterType":"AzureIR",
            "duration":0.016666666666666666,
            "unit":"Hours"
         }
      ]
   }
}

I want the output from the first (in this case, only) result set, so that’s resultSets[0]. The data returned is in the rows array in that result set. So that’s resultSets[0].rows.

How are you liking the script activity in Data Factory? Is there anything else you wish were included in the documentation? Let me know in the comments.

Microsoft Technologies, Power BI, Workout Wednesday

Clickable SVG images in Power BI using the HTML Content custom visual

People have done creative things with SVG measures in Power BI, ranging from KPI cards to infographics to fun games.

For my latest Workout Wednesday challenge, I used SVG measures to make holiday cards that open on a specified date.

Power BI report with 1 open Christmas card and 3 closed cards.
In the Power BI report for Workout Wednesday 2022 Week 48, the holiday cards are populated using SVG measures

When you click on one of the holiday cards, it navigates to a specified url. This was made possible by using the HTML Content custom visual.

The navigation to the URL is achieved by modifying the SVG code to include an href attribute. Depending on the placement of the href attribute, you can make one part of the SVG image or the entire image navigate to a URL when clicked.

Step by Step

To make a clickable SVG image for Power BI, there are 7 steps:

  1. Open the url in a text editor or html editor
  2. Replace all double quotes with single quotes
  3. Add href attribute around the content you want to be clickable
  4. Create a measure in Power BI and paste the contents of the SVG
  5. Add the HTML Content visual to a report page
  6. Populate the values of the visual with the measure
  7. In the format pane for the visual, set Allow Opening URLS to On.

For example, I have an SVG of a coffee cup.

coffee cup with steam coming out of it

If I open it in Notepad++ (you can also use Visual Studio code or another editor), it looks like this.

HTML for an SVG image opened in Notepad++

Because we are putting the contents in a DAX measure, we need to replace the double quotes with single quotes.

The Find and Replace dialog in Notepad++ set to find double quotes  and replace it with single quotes.

Then I add the href attribute. I want my entire image to navigate to my website (DataSavvy.me) when it is clicked. So I add <a href='https://datasavvy.me'> just after the opening <svg> tag, and I add a closing </a> at the end. Remember that the URL should be surrounded by single quotes rather than double quotes.

HTML code for an SVG image with an href attribute added.

Then I create a measure called SVG. I enter double quotes, paste the content from Notepad++, and add closing quotes as the end. Because I’m using the HTML content visual, I don’t have to add "data:image/svg+xml;utf8," at the beginning of my measure as I would if I were using this in a table visual.

Now I add the HTML Content visual and put my SVG measure in the Values field well.

Power BI Desktop showing a coffee cup image on a report page. The coffee cup visual is selected. The measure named SVG is placed in Values.

With the visual selected, I go to the formatting pane, expand the Content Formatting section, and turn Allow Opening URLs to On.

The format pane showing the Allow Opening URls option is set to on for the HTML Content visual.

When I hover over the image, the cursor changes, indicating the image is clickable.

A screenshot from Power BI Desktop with a cursor hovering over the image, indicating the image is clickable

When I click the image, I get a prompt to allow navigation to the url I put in the SVG.

A dialog in Power BI that says "You are about to navigate to: https://datasavvy.me. The options available are "OK" and "Cancel".

New possibilities unlocked

While static clickable SVGs are pretty cool, the potential is really in the fact that we can dynamically populate the SVG based upon data in our dataset. You can change the entire image or an attribute of the image (color, size, URL, etc.) based upon a slicer selection.

Now that you can make dynamic clickable images in Power BI, how do you plan to use them?

Consulting, DCAC, Microsoft Technologies

Types of data consulting engagements and what you can expect from them

There are multiple ways organizations can engage with a data (DBA/analytics/data architect/ML/etc.) consultant. The type of engagement you choose affects the pace and deliverables of the project, and the response times and availability of the consultant.

Two women are seated at a desk looking at a laptop screen. One woman is pointing at an object on the screen while the other looks on.
Photo from Unsplash

What types of consulting engagements are common?

Consulting engagements can range from a few hours a week to several months of full-time work. The way they are structured and billed affects how consultants work with clients. Here are some types of engagements that you will commonly see in data and IT consulting. .

Office hours/as-needed advising: This engagement type usually involves advising rather than hands-on work. This type of engagement may be used to “try out” a consultant to determine if they have the required knowledge and how the client likes working with them. Or it may be “after-care” once a project has been completed, so a client has access to advice on how to keep a new solution up and running. In this type of engagement, a consultant’s time is either booked in a recurring meeting or as needed.

Time & materials/bucket of hours: In a time and materials (T&M) engagement, a client has purchased a certain number of hours with a consultant that may be used for advising or hands-on work. (The “materials” part is for other purchases necessary such as travel or equipment.) While the client may identify some deliverables or outcomes to work towards, they are paying for an amount of time rather than a specific deliverable or outcome. This bucket of hours is often useful for project after-care where the consultant will be hands-on. It is also useful for a collection of small tasks that a client needs a consultant to complete. Clients commonly use these hours over several partial days or weeks rather than engaging the consultant full-time.

Retainer: When clients engage consultants on a retainer basis, they are paying for them to be available for a certain number of hours per week/month/quarter. This is commonly used for recurring maintenance work. It can also be used when a client knows they will have enough development tasks to engage the consultant each month, but they don’t know exactly what those tasks will be. This type of engagement is usually a 3- to 12-month commitment.

Project-based time & materials: In this engagement type, there is a specific project a client needs a consultant to complete or assist in completing. There is a scope defined at the beginning of the project with some rough requirements and an estimated timeline. If the work takes less time than anticipated, the client only pays for the hours used. If the work takes more time (usually due to unforeseen issues or changing requirements), the client and consultant will have to agree to an extension.

Project-based fixed-fee: In a fixed-fee project, the client and the consultant are agreeing to an amount of money in return for specified deliverables. This involves much more up-front effort in requirements gathering and project estimation than a time & materials engagement. This is because the fee stays the same whether the consultant finishes in the estimated time or not. If the fixed-fee project costs $100,000 and it ends up using $105,000 worth of consulting time & materials, the client does not owe the consultant $5000 (unless they have violated a part of the agreement and agreed to the additional hours before they were worked). In this case, the consultant simply makes less profit.

Staff augmentation: In a staff aug agreement, the consultant and client agree that the consultant will work a set number of hours, usually close to full-time, per week in a specified position. There are no stated deliverables, just expectations of hours worked and skills to be used.

Risks and rewards

Office hours

Office hours is the lowest level of commitment for both clients and consultants as it usually involves a small number of hours. As a client, you aren’t stuck for long if you find the consultant isn’t a good fit.

But if you don’t agree to recurring meetings, you are taking a chance that the consultant will not be available on the day and time you need them. You must understand that the consultant has other clients, many of whom are paying more money for more time with the consultant, who is “squeezing you in” between tasks for other clients. I personally enjoy these types of engagements because they are easy to fit in my schedule and don’t usually require a lot of preparation before meetings.

An office-hours engagement is not appropriate for complex hands-on work. It can be good for design and architecture discussions, or for help solving a specific problem. I’ve had clients successfully use office hours for help with DAX measures in Power BI. I’ve had helpful white-boarding sessions during office hours. But when something looks like it will become a full project, or there are urgent troubleshooting needs with high complexity, I usually suggest that a different type of engagement would be more helpful.

Bucket of hours

The biggest risk with the bucket of hours is scheduling and availability. It can be helpful to agree that the hours must be used by a specific date. This helps the consultant plan for those hours in their schedule and ensure that revenue will be earned in the period expected, rather than leaving the agreement open indefinitely. But clients must manage the hours to ensure they are used before the expiration date.

This type of engagement is best when there is effective communication between the client and consultant and deadlines are somewhat flexible. Since the consultant isn’t engaged full-time, they will have other deadlines for other clients that they must work around. This sometimes requires a bit of patience from clients. Unless it is specified in the agreement, clients usually can’t expect consultants to be immediately available.

I’ve seen two ways this type of engagement is used successfully.

  1. Clients clearly communicate tasks and deadlines each week and consultants deliver them at the end of the week, until the engagement is over.
  2. Clients use the hours for non-urgent support, where work consists mostly of paired programming or troubleshooting a system with which the consultant is familiar. Clients give the consultant at least a day’s notice when scheduling a meeting. The consultant only has tasks that arise as action items from the programming/troubleshooting work sessions.

While there may be a theme to the work, no one has agreed to deliver specific outcomes in this type of engagement. If that is needed, a project-based engagement may be more suitable. Buckets of hours are good for short-term tasks and support.

Retainer

Retainers involve a bit more commitment, as they usually last for multiple months. Often, retainer hours are discounted compared to time & material hours because the stability and long-term relationship are valued by the consultant.

At DCAC, we often use retainers for “DBA-as-a-service” engagements, where clients need someone to perform patching and maintenance, monitor databases, tune queries, and perform backups and restores. They don’t know exactly how many hours each task will require each month, but they are sure they will need a consultant for the agreed upon number of hours.

Retainer hours are often “use them or lose them”. If clients don’t give the consultant work to do, the hours won’t roll over to the next month.

Because retainers usually involve part-time work, it’s important to set expectations about the consultant’s availability. If a client needs the consultant to be available immediately for urgent support matters, that should be written into the agreement (e.g., “The Consultant will respond to all support requests within one hour of receipt.”).

It’s more difficult to do retainer hours for development projects. If the consultant uses all the hours for the month before a project is completed, the client either needs to find more budget for the extra hours or wait until the next month to resume project tasks. If there are real project deadlines, waiting a week or more to reach the start of the next month is probably not feasible. If you need a consultant to focus work on a single project with tight deadlines, it’s better to have a project-based engagement.

If the consultant is assisting other project members, and it is certain that the retainer hours will be used each month, it is possible to have a retainer for development efforts. I have a client who has a retainer agreement right now that has me perform a variety of small tasks each month. Sometimes it’s Azure Data Factory development and support. Some months involve writing PowerShell for automation runbooks. Other months, I help them troubleshoot Power BI models and reports. We meet weekly to discuss the tasks and assistance needed and plan tasks to ensure that we use the allotted hours each month without going over. But this only works because my client trusts me and keeps the communication flowing.

Project-based time & materials

This is the most common type of engagement I see in business intelligence/analytics consulting. In this project type, the agreement specifies expected deliverables and estimated effort for high-level tasks. While detailed requirements might not be determined up front, it is important to specify assumptions along with the scope and deliverables. If something violates an assumption, it will likely affect the time and cost it will take to complete the project.

With any type of project-based work, it might be helpful to include a discovery phase at the beginning of the engagement to better understand project requirements, constraints, and risks. After the discovery phase, the project estimate and timelines can be updated to reflect any new information that was uncovered. While this won’t keep scope and requirements from changing mid-project, it helps people plan a bit more up front instead of “going in blind”.

As with the other time & materials engagements, clients only pay for the hours used. So, if the project was estimated to take 200 hours, but it is completed in 180 hours, the client pays for 180 hours.

Project-based time and materials engagements often have consultants working full-time on the project, but that is not always the case. It’s important to establish expectations. Clients and consultants should discuss and agree to deadlines and availability for meetings and working sessions.

Project-based fixed fee

Fixed fee projects are all about deliverables and outcomes. Because of this, they carry the most risk for consultants. They require the most detailed agreements as far as scope, constraints, and assumptions. It is particularly important to include this information in the agreement and have both parties acknowledge it. Then, if something changes, you can refer to the agreement when discussing scope/cost/timeline changes.

It’s important for clients to read and understand the scope and assumptions. While it may be a less technical executive that actually signs the agreement, a technical person who can competently review the scope and assumptions on behalf of the client should do so before the agreement is signed.

Because it is easy to underestimate the work needed to complete the deliverables, consultants often “pad” their estimate with more hours than what they think is necessary to cover any unexpected complications. Most people underestimate effort, so if the actual hours were to be different, this would usually end up in the client’s favor. But it’s not uncommon to see large amounts of hours in an estimate in order to cover the risk.

It’s important for the person estimating the project to consider time required for software installation and validating system access, project management, learning and implementing unfamiliar technologies, knowledge transfer, design reviews, and other tasks that don’t immediately come to mind when estimating. If there are less experienced people working on the project, that could increase the hours needed.

I have learned that it takes less time to complete a project if it’s mostly me and my team completing the work. If I have client team members working with me, I usually have to increase the hours required, simply because I don’t know their personalities and skills and I’m not used to working with them.

Due to the risk of underestimation, many consultants do not like to undertake large fixed fee projects. Sometimes it’s better to break up a larger fixed-fee project into smaller phases/projects to reduce the risk. This is especially true when a client and consultant have not previously worked together.

Staff Augmentation

I personally do not consider staff aug to be true consulting. Staff aug is basically becoming a non-FTE (full-time employee) team member. It is a valid way to be a DBA/BI/ML practitioner, and many consultants do some staff aug work at some point in their careers. But it doesn’t necessarily require the “consultant” in the relationship to be consultative, and they may or may not have more skills than those present on the client team. Some companies treat staff augmentation as just a “butt in a seat”. But it’s also possible to be a knowledgeable and consultative consultant who happens to be working with a client through a staff augmentation agreement.

For independent consultants, the risk for this type of engagement is that it is likely full-time or close to it, which makes it difficult to maintain business with other clients. Having only one main client can be risky if something goes wrong. For consulting firms, there is an opportunity cost in allocating a consultant full-time to a client, especially if the consultant has skills that other consultants do not have. Depending on the length of the agreement, there is a risk that the consultant will feel like their skills are stagnating or be unhappy until they can work with a different client. For clients, having temporary team members can decrease consistency and institutional knowledge as people are only around for a few months to a year.

Choose an engagement type that matches the work you need done

You are more likely to have a successful consulting engagement if you go into it knowing the common risks and rewards. Many problems I have seen in consulting have been due to poor communication or trying to do work that doesn’t fit the engagement type. Whether you are a consultant or a client, it’s important to speak up if you feel like an engagement is not going well. There is no way to fix it if you can’t have a conversation about it.

Whether you are the client or consultant, you can propose changes to agreements before they are signed. If you find something is missing or concerning, speak up about it so everyone feels good about the agreement that is being signed. Consulting engagements are more successful when clients and consultants can support each other rather than having an adversarial relationship.