Azure, Databricks, Unity Catalog

External tables and views in Azure Databricks Unity Catalog

I’ve been busy defining objects in my Unity Catalog metastore to create a secure exploratory environment for analysts and data scientists. I’ve found a lack of examples for doing this in Azure with file types other than delta (maybe you’re reading this in the future and this is no longer a problem, but it was when I wrote this). So I wanted to get some more examples out there in case it helps others.

I’m not storing any data in Databricks – I’m leaving my data in the data lake and using Unity Catalog to put a tabular schema on top of it (hence the use of external tables vs managed tables. In order to reference an ADLS account, you need to define a storage credential and an external location.

External tables

External tables in Databricks are similar to external tables in SQL Server. We can use them to reference a file or folder that contains files with similar schemas. External tables can use the following file formats:

  • delta
  • csv
  • json
  • avro
  • parquet
  • orc
  • text

If you don’t specify the file format in the USING clause of your DDL statement, it will use the default of delta.

Below is an example of creating an external table from a single CSV file.

CREATE TABLE mycatalog.myschema.external_table1
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION 'abfss://containername@storageaccountname.dfs.core.windows.net/TopFolder/SecondFolder/myfile.csv';

Because I have used the LOCATION clause, this is an external table that stores just metadata. This SQL locates the specified file in my data lake and has Databricks create the schema based upon that file instead of me defining each column. Notice that I have specified in the options on the third line that there is a header row in my file and that Databricks should figure out the schema of the table.

Alternatively, I could explicitly define the columns for my external table. You can find the list of supported data types here.

CREATE TABLE mycatalog.myschema.external_table1 
(
  colA  INT,
  colB  STRING,
  colC  STRING
)
USING CSV 
OPTIONS (header "true") 
LOCATION 'abfss://containername@storageaccountname.dfs.core.windows.net/TopFolder/SecondFolder/myfile.csv';

External views

I had some JSON data in my lake that Databricks couldn’t automatically convert to a table so I created some external views. My data had a format similar to the below, with each document containing a single array that contained multiple objects, some of which were nested.

{
    "mystuff": [
        {
            "cola": "1",
            "colb": "2",
            "colc": "abc",
            "nestedthing": {
                "id": 1,
                "name": "thing1"
            }
        },
        {
            "cola": "2",
            "colb": "4",
            "colc": "def",
            "nestedthing": {
                "id": 22,
                "name": "thing22"
            }
        },
        {
            "cola": "3",
            "colb": "6",
            "colc": "ghi"
        }
    ]
}

The example view below directly queries a file in the data lake.

CREATE VIEW mycatalog.myschema.external_view1
select
  src.cola,
  src.colb,
  src.colc,
  src.nestedthing.id,
  src.nestedthing.name
FROM
  (
    select
      explode(mystuff) src
    FROM
      json.`abfss://containername@storageaccountname.dfs.core.windows.net/TopFolder/SecondFolder/myfile2.json`
  ) x

To reference the file in the data lake in the FROM clause of the query, we specify the file format first (JSON) followed by a dot and then the file path surround by backticks (not single quotes). If we needed to reference a folder instead we would just end the path at the folder name (no trailing slash is necessary).

The explode() function is great for turning objects in an array into columns in a tabular dataset. To access nested objects, you can use dot notation. If you need to parse more complex JSON, this is a helpful resource.

The query from the view above creates the following output.

A table containing 5 columns: cola, colb, colc, id, name.

I’m not sure yet if there are any consequences (performance? security?) of defining a view like this rather than first creating an external table. I couldn’t get the external table created without modifying the JSON files, which I was trying to avoid. I do the view produces the correct results. If you have experimented with this, let me know what you learned.

Azure, Databricks, Microsoft Technologies, Unity Catalog

Creating a Unity Catalog in Azure Databricks

Unity Catalog in Databricks provides a single place to create and manage data access policies that apply across all workspaces and users in an organization. It also provides a simple data catalog for users to explore. So when a client wanted to create a place for statisticians and data scientists to explore the data in their data lake using a web interface, I suggested we use Databricks with Unity Catalog.

New account management and roles

There are some Databricks concepts that may be new for you in Azure when you use Unity Catalog. While Databricks workspaces are mostly the same, you now have account (organization) -level roles. This is necessary because Unity Catalog crosses workspaces.

Databricks account covers all workspaces in your Azure tenant. There are 3 account-level roles: account admin, metastore admin, account user. Underneath the account are one or more workspaces. Workspace roles include workspace admins and workspace users.
There are two levels of accounts and admins in Azure Databricks

Users and service principals created in a workspace are synced to the account as account-level users and service principals. Workspace-local groups are not synced to the account. There are now account-level groups that can be used in workspaces.

To manage your account, you can go to https://accounts.azuredatabricks.net/. If you log in with an AAD B2B user, you’ll need to open the account portal from within a workspace. To do this, go to your workspace and select your username in the top right of the page to open the menu. Then choose the Manage Account option in the menu. It will open a new browser window.

The user menu in a Databricks workspace with the fourth option, Manage Account emphasized.
To launch the account console, choose Manage Account from the menu under your username in a workspace

Requirements

To create a Unity Catalog metastore you’ll need:

The pricing tier is set on the basics page when creating the Databricks workspace.

The Basics page of the Create and Azure Databricks workspace workflow in the Azure Portal has a pricing tier option near the bottom. Make sure this is set to Premium (+ Role-based access controls).
If you plan to use Unity Catalog, be sure to select the Premium pricing tier for your Databricks workspace

High-level steps

  1. Create a storage container in your ADLS account.
  2. Create an access connector for Azure Databricks.
  3. Assign the Storage Blob Data contributor role on the storage account to the managed identity for the access connector for Azure Databricks.
  4. Assign yourself account administrator in the Databricks account console.
  5. Create a Unity Catalog metastore.
  6. Assign a workspace to the Unity Catalog metastore.

The storage container holds the metadata and any managed data for your metastore. You’ll likely want to use this Unity Catalog metastore rather than the default hive metastore that comes with your Databricks workspace.

The access connector will show up as a separate resource in the Azure Portal.

The Azure Portal showing two resources in a list. The type of the first resource is Access Connector for Azure Databricks. The type of the second resource is Azure Databricks Service.

You don’t grant storage account access to end users – you’ll grant user access to data via Unity Catalog. The access connector allows the workspace to access the data in the storage account on behalf of Unity Catalog users. This is why you must assign the Storage Blob Data Contributor to the access connector.

The Add role assignment page for Access Control on the storage account. The selected role is Storage Blob Data Contributor. Assign access to is set to Managed Identity. The member listed is DBxAccessConnector, which is the access connector for Azure Databricks.
Assigning the Storage Blob Data Contributor role to the managed identity for the Access Connector for Azure Databricks

The confusing part of setup

If you are not a Databricks account administrator, you won’t see the option to create a metastore in the account console. If you aren’t an AAD Global Admin, you need an AAD Global Admin to log into the Databricks account console and assign your user to the account admin role. It’s quite possible that the AAD Global Admin(s) in your tenant don’t know and don’t care what Databricks or Unity Catalog is. And most data engineers are not global admin in their company’s tenant. If you think this requirement should be changed, feel free to vote for my idea here.

Once you have been assigned the account admin role in Databricks, you will see the button to create a metastore in the account console.

The Databricks account console has a large green button labeled Create a metastore, which is only visible to account admins.
The Create a meatastore button is only available for Databricks account admins

One or multiple metastores?

The Azure documentation recommends only creating one metastore per region and assigning that metastore to multiple workspaces. The current recommended best practice is to have one catalog that spans environments, business units, and teams. Currently, there is no technical limitation keeping you from creating multiple metastores. The recommendation is pointing you toward a single, centralized place to manage data and permissions.

Within your metastore, you can organize your data into catalogs and schemas. So it could be feasible to use only one metastore if you have all Databricks resources in one region.

In my first metastore, I’m using catalogs to distinguish environments (dev/test/prod) and schemas to distinguish business units. In my scenario, each business unit owns their own data. Within those schemas are external tables and views. Because it is most likely that someone would need to see all data for a specific business unit, this makes it easy to grant user access at the schema level.

I’ll save table creation and user access for another post. This post stops at getting you through all the setup steps to create your Unity Catalog metastore.

If you prefer learning from videos, I’ve found the Unity Catalog videos on the Advancing Spark YouTube channel to be very helpful.