I had to do this for a client the other day, and I realized I hadn’t blogged about it. Let’s say you need to include data in a Power BI model, but the only source of the data is a matrix that is output from another system. And that matrix has multiple fields populating the columns. An example of this is below. The matrix has fiscal year and product category on columns, vertical on rows, and the profit metric populating the values.
You may know about the wonderful unpivot functionality in Power Query, but to handle this matrix, you first need to Transpose.
The steps to turn this matrix into a table are:
Transpose the query.
Remove the last column that contains the vertical totals.
Filter out the “Totals” value in the second column, which contains the product categories.
Use the first row as headers.
Select the Fiscal Year, Product Category, and Metric columns. Select Unpivot Other Columns.
Rename the Attribute column to Verticals.
Transposing a table changes categories into rows.
We don’t need the totals columns or rows because Power BI can recalculate those, and we don’t want to double-count profits.
In order to unpivot, we need to promote the first row to column headers, so the first column is labeled Fiscal Year and the fourth column is labeled Vertical Z.
The first three columns are as they should be, but the verticals need to change from columns to rows. This is accomplished by unpivoting. But you only want to unpivot the verticals and leave the fiscal year, product category, and metric columns as they are.
Then make sure column names are user-friendly, and it’s finished. You may also opt to remove the Metric column and rename the value column, if there is only one metric.
I’ve been busy defining objects in my Unity Catalog metastore to create a secure exploratory environment for analysts and data scientists. I’ve found a lack of examples for doing this in Azure with file types other than delta (maybe you’re reading this in the future and this is no longer a problem, but it was when I wrote this). So I wanted to get some more examples out there in case it helps others.
I’m not storing any data in Databricks – I’m leaving my data in the data lake and using Unity Catalog to put a tabular schema on top of it (hence the use of external tables vs managed tables. In order to reference an ADLS account, you need to define a storage credential and an external location.
External tables in Databricks are similar to external tables in SQL Server. We can use them to reference a file or folder that contains files with similar schemas. External tables can use the following file formats:
If you don’t specify the file format in the USING clause of your DDL statement, it will use the default of delta.
Below is an example of creating an external table from a single CSV file.
Because I have used the LOCATION clause, this is an external table that stores just metadata. This SQL locates the specified file in my data lake and has Databricks create the schema based upon that file instead of me defining each column. Notice that I have specified in the options on the third line that there is a header row in my file and that Databricks should figure out the schema of the table.
Alternatively, I could explicitly define the columns for my external table. You can find the list of supported data types here.
CREATE TABLE mycatalog.myschema.external_table1
OPTIONS (header "true")
I had some JSON data in my lake that Databricks couldn’t automatically convert to a table so I created some external views. My data had a format similar to the below, with each document containing a single array that contained multiple objects, some of which were nested.
The example view below directly queries a file in the data lake.
CREATE VIEW mycatalog.myschema.external_view1
To reference the file in the data lake in the FROM clause of the query, we specify the file format first (JSON) followed by a dot and then the file path surround by backticks (not single quotes). If we needed to reference a folder instead we would just end the path at the folder name (no trailing slash is necessary).
The explode() function is great for turning objects in an array into columns in a tabular dataset. To access nested objects, you can use dot notation. If you need to parse more complex JSON, this is a helpful resource.
The query from the view above creates the following output.
I’m not sure yet if there are any consequences (performance? security?) of defining a view like this rather than first creating an external table. I couldn’t get the external table created without modifying the JSON files, which I was trying to avoid. I do the view produces the correct results. If you have experimented with this, let me know what you learned.
In early 2022, Microsoft released a new activity in Azure Data Factory (ADF) called the Script activity. The Script activity allows you to execute one or more SQL statements and receive zero, one, or multiple result sets as the output. This is an advantage over the stored procedure activity that was already available in ADF, as the stored procedure activity doesn’t support using the result set returned from a query in a downstream activity.
However, when I went to find examples of how to reference those result sets, the documentation was lacking. It currently just says:
When you click on one of the holiday cards, it navigates to a specified url. This was made possible by using the HTML Content custom visual.
The navigation to the URL is achieved by modifying the SVG code to include an href attribute. Depending on the placement of the href attribute, you can make one part of the SVG image or the entire image navigate to a URL when clicked.
Step by Step
To make a clickable SVG image for Power BI, there are 7 steps:
Open the url in a text editor or html editor
Replace all double quotes with single quotes
Add href attribute around the content you want to be clickable
Create a measure in Power BI and paste the contents of the SVG
Add the HTML Content visual to a report page
Populate the values of the visual with the measure
In the format pane for the visual, set Allow Opening URLS to On.
For example, I have an SVG of a coffee cup.
If I open it in Notepad++ (you can also use Visual Studio code or another editor), it looks like this.
Because we are putting the contents in a DAX measure, we need to replace the double quotes with single quotes.
Then I add the href attribute. I want my entire image to navigate to my website (DataSavvy.me) when it is clicked. So I add <a href='https://datasavvy.me'> just after the opening <svg> tag, and I add a closing </a> at the end. Remember that the URL should be surrounded by single quotes rather than double quotes.
Then I create a measure called SVG. I enter double quotes, paste the content from Notepad++, and add closing quotes as the end. Because I’m using the HTML content visual, I don’t have to add "data:image/svg+xml;utf8," at the beginning of my measure as I would if I were using this in a table visual.
Now I add the HTML Content visual and put my SVG measure in the Values field well.
With the visual selected, I go to the formatting pane, expand the Content Formatting section, and turn Allow Opening URLs to On.
When I hover over the image, the cursor changes, indicating the image is clickable.
When I click the image, I get a prompt to allow navigation to the url I put in the SVG.
New possibilities unlocked
While static clickable SVGs are pretty cool, the potential is really in the fact that we can dynamically populate the SVG based upon data in our dataset. You can change the entire image or an attribute of the image (color, size, URL, etc.) based upon a slicer selection.
Now that you can make dynamic clickable images in Power BI, how do you plan to use them?
There are multiple ways organizations can engage with a data (DBA/analytics/data architect/ML/etc.) consultant. The type of engagement you choose affects the pace and deliverables of the project, and the response times and availability of the consultant.
What types of consulting engagements are common?
Consulting engagements can range from a few hours a week to several months of full-time work. The way they are structured and billed affects how consultants work with clients. Here are some types of engagements that you will commonly see in data and IT consulting. .
Office hours/as-needed advising: This engagement type usually involves advising rather than hands-on work. This type of engagement may be used to “try out” a consultant to determine if they have the required knowledge and how the client likes working with them. Or it may be “after-care” once a project has been completed, so a client has access to advice on how to keep a new solution up and running. In this type of engagement, a consultant’s time is either booked in a recurring meeting or as needed.
Time & materials/bucket of hours: In a time and materials (T&M) engagement, a client has purchased a certain number of hours with a consultant that may be used for advising or hands-on work. (The “materials” part is for other purchases necessary such as travel or equipment.) While the client may identify some deliverables or outcomes to work towards, they are paying for an amount of time rather than a specific deliverable or outcome. This bucket of hours is often useful for project after-care where the consultant will be hands-on. It is also useful for a collection of small tasks that a client needs a consultant to complete. Clients commonly use these hours over several partial days or weeks rather than engaging the consultant full-time.
Retainer: When clients engage consultants on a retainer basis, they are paying for them to be available for a certain number of hours per week/month/quarter. This is commonly used for recurring maintenance work. It can also be used when a client knows they will have enough development tasks to engage the consultant each month, but they don’t know exactly what those tasks will be. This type of engagement is usually a 3- to 12-month commitment.
Project-based time & materials: In this engagement type, there is a specific project a client needs a consultant to complete or assist in completing. There is a scope defined at the beginning of the project with some rough requirements and an estimated timeline. If the work takes less time than anticipated, the client only pays for the hours used. If the work takes more time (usually due to unforeseen issues or changing requirements), the client and consultant will have to agree to an extension.
Project-based fixed-fee: In a fixed-fee project, the client and the consultant are agreeing to an amount of money in return for specified deliverables. This involves much more up-front effort in requirements gathering and project estimation than a time & materials engagement. This is because the fee stays the same whether the consultant finishes in the estimated time or not. If the fixed-fee project costs $100,000 and it ends up using $105,000 worth of consulting time & materials, the client does not owe the consultant $5000 (unless they have violated a part of the agreement and agreed to the additional hours before they were worked). In this case, the consultant simply makes less profit.
Staff augmentation: In a staff aug agreement, the consultant and client agree that the consultant will work a set number of hours, usually close to full-time, per week in a specified position. There are no stated deliverables, just expectations of hours worked and skills to be used.
Risks and rewards
Office hours is the lowest level of commitment for both clients and consultants as it usually involves a small number of hours. As a client, you aren’t stuck for long if you find the consultant isn’t a good fit.
But if you don’t agree to recurring meetings, you are taking a chance that the consultant will not be available on the day and time you need them. You must understand that the consultant has other clients, many of whom are paying more money for more time with the consultant, who is “squeezing you in” between tasks for other clients. I personally enjoy these types of engagements because they are easy to fit in my schedule and don’t usually require a lot of preparation before meetings.
An office-hours engagement is not appropriate for complex hands-on work. It can be good for design and architecture discussions, or for help solving a specific problem. I’ve had clients successfully use office hours for help with DAX measures in Power BI. I’ve had helpful white-boarding sessions during office hours. But when something looks like it will become a full project, or there are urgent troubleshooting needs with high complexity, I usually suggest that a different type of engagement would be more helpful.
Bucket of hours
The biggest risk with the bucket of hours is scheduling and availability. It can be helpful to agree that the hours must be used by a specific date. This helps the consultant plan for those hours in their schedule and ensure that revenue will be earned in the period expected, rather than leaving the agreement open indefinitely. But clients must manage the hours to ensure they are used before the expiration date.
This type of engagement is best when there is effective communication between the client and consultant and deadlines are somewhat flexible. Since the consultant isn’t engaged full-time, they will have other deadlines for other clients that they must work around. This sometimes requires a bit of patience from clients. Unless it is specified in the agreement, clients usually can’t expect consultants to be immediately available.
I’ve seen two ways this type of engagement is used successfully.
Clients clearly communicate tasks and deadlines each week and consultants deliver them at the end of the week, until the engagement is over.
Clients use the hours for non-urgent support, where work consists mostly of paired programming or troubleshooting a system with which the consultant is familiar. Clients give the consultant at least a day’s notice when scheduling a meeting. The consultant only has tasks that arise as action items from the programming/troubleshooting work sessions.
While there may be a theme to the work, no one has agreed to deliver specific outcomes in this type of engagement. If that is needed, a project-based engagement may be more suitable. Buckets of hours are good for short-term tasks and support.
Retainers involve a bit more commitment, as they usually last for multiple months. Often, retainer hours are discounted compared to time & material hours because the stability and long-term relationship are valued by the consultant.
At DCAC, we often use retainers for “DBA-as-a-service” engagements, where clients need someone to perform patching and maintenance, monitor databases, tune queries, and perform backups and restores. They don’t know exactly how many hours each task will require each month, but they are sure they will need a consultant for the agreed upon number of hours.
Retainer hours are often “use them or lose them”. If clients don’t give the consultant work to do, the hours won’t roll over to the next month.
Because retainers usually involve part-time work, it’s important to set expectations about the consultant’s availability. If a client needs the consultant to be available immediately for urgent support matters, that should be written into the agreement (e.g., “The Consultant will respond to all support requests within one hour of receipt.”).
It’s more difficult to do retainer hours for development projects. If the consultant uses all the hours for the month before a project is completed, the client either needs to find more budget for the extra hours or wait until the next month to resume project tasks. If there are real project deadlines, waiting a week or more to reach the start of the next month is probably not feasible. If you need a consultant to focus work on a single project with tight deadlines, it’s better to have a project-based engagement.
If the consultant is assisting other project members, and it is certain that the retainer hours will be used each month, it is possible to have a retainer for development efforts. I have a client who has a retainer agreement right now that has me perform a variety of small tasks each month. Sometimes it’s Azure Data Factory development and support. Some months involve writing PowerShell for automation runbooks. Other months, I help them troubleshoot Power BI models and reports. We meet weekly to discuss the tasks and assistance needed and plan tasks to ensure that we use the allotted hours each month without going over. But this only works because my client trusts me and keeps the communication flowing.
Project-based time & materials
This is the most common type of engagement I see in business intelligence/analytics consulting. In this project type, the agreement specifies expected deliverables and estimated effort for high-level tasks. While detailed requirements might not be determined up front, it is important to specify assumptions along with the scope and deliverables. If something violates an assumption, it will likely affect the time and cost it will take to complete the project.
With any type of project-based work, it might be helpful to include a discovery phase at the beginning of the engagement to better understand project requirements, constraints, and risks. After the discovery phase, the project estimate and timelines can be updated to reflect any new information that was uncovered. While this won’t keep scope and requirements from changing mid-project, it helps people plan a bit more up front instead of “going in blind”.
As with the other time & materials engagements, clients only pay for the hours used. So, if the project was estimated to take 200 hours, but it is completed in 180 hours, the client pays for 180 hours.
Project-based time and materials engagements often have consultants working full-time on the project, but that is not always the case. It’s important to establish expectations. Clients and consultants should discuss and agree to deadlines and availability for meetings and working sessions.
Project-based fixed fee
Fixed fee projects are all about deliverables and outcomes. Because of this, they carry the most risk for consultants. They require the most detailed agreements as far as scope, constraints, and assumptions. It is particularly important to include this information in the agreement and have both parties acknowledge it. Then, if something changes, you can refer to the agreement when discussing scope/cost/timeline changes.
It’s important for clients to read and understand the scope and assumptions. While it may be a less technical executive that actually signs the agreement, a technical person who can competently review the scope and assumptions on behalf of the client should do so before the agreement is signed.
Because it is easy to underestimate the work needed to complete the deliverables, consultants often “pad” their estimate with more hours than what they think is necessary to cover any unexpected complications. Most people underestimate effort, so if the actual hours were to be different, this would usually end up in the client’s favor. But it’s not uncommon to see large amounts of hours in an estimate in order to cover the risk.
It’s important for the person estimating the project to consider time required for software installation and validating system access, project management, learning and implementing unfamiliar technologies, knowledge transfer, design reviews, and other tasks that don’t immediately come to mind when estimating. If there are less experienced people working on the project, that could increase the hours needed.
I have learned that it takes less time to complete a project if it’s mostly me and my team completing the work. If I have client team members working with me, I usually have to increase the hours required, simply because I don’t know their personalities and skills and I’m not used to working with them.
Due to the risk of underestimation, many consultants do not like to undertake large fixed fee projects. Sometimes it’s better to break up a larger fixed-fee project into smaller phases/projects to reduce the risk. This is especially true when a client and consultant have not previously worked together.
I personally do not consider staff aug to be true consulting. Staff aug is basically becoming a non-FTE (full-time employee) team member. It is a valid way to be a DBA/BI/ML practitioner, and many consultants do some staff aug work at some point in their careers. But it doesn’t necessarily require the “consultant” in the relationship to be consultative, and they may or may not have more skills than those present on the client team. Some companies treat staff augmentation as just a “butt in a seat”. But it’s also possible to be a knowledgeable and consultative consultant who happens to be working with a client through a staff augmentation agreement.
For independent consultants, the risk for this type of engagement is that it is likely full-time or close to it, which makes it difficult to maintain business with other clients. Having only one main client can be risky if something goes wrong. For consulting firms, there is an opportunity cost in allocating a consultant full-time to a client, especially if the consultant has skills that other consultants do not have. Depending on the length of the agreement, there is a risk that the consultant will feel like their skills are stagnating or be unhappy until they can work with a different client. For clients, having temporary team members can decrease consistency and institutional knowledge as people are only around for a few months to a year.
Choose an engagement type that matches the work you need done
You are more likely to have a successful consulting engagement if you go into it knowing the common risks and rewards. Many problems I have seen in consulting have been due to poor communication or trying to do work that doesn’t fit the engagement type. Whether you are a consultant or a client, it’s important to speak up if you feel like an engagement is not going well. There is no way to fix it if you can’t have a conversation about it.
Whether you are the client or consultant, you can propose changes to agreements before they are signed. If you find something is missing or concerning, speak up about it so everyone feels good about the agreement that is being signed. Consulting engagements are more successful when clients and consultants can support each other rather than having an adversarial relationship.
Unity Catalog in Databricks provides a single place to create and manage data access policies that apply across all workspaces and users in an organization. It also provides a simple data catalog for users to explore. So when a client wanted to create a place for statisticians and data scientists to explore the data in their data lake using a web interface, I suggested we use Databricks with Unity Catalog.
New account management and roles
There are some Databricks concepts that may be new for you in Azure when you use Unity Catalog. While Databricks workspaces are mostly the same, you now have account (organization) -level roles. This is necessary because Unity Catalog crosses workspaces.
Users and service principals created in a workspace are synced to the account as account-level users and service principals. Workspace-local groups are not synced to the account. There are now account-level groups that can be used in workspaces.
To manage your account, you can go to https://accounts.azuredatabricks.net/. If you log in with an AAD B2B user, you’ll need to open the account portal from within a workspace. To do this, go to your workspace and select your username in the top right of the page to open the menu. Then choose the Manage Account option in the menu. It will open a new browser window.
The storage container holds the metadata and any managed data for your metastore. You’ll likely want to use this Unity Catalog metastore rather than the default hive metastore that comes with your Databricks workspace.
The access connector will show up as a separate resource in the Azure Portal.
You don’t grant storage account access to end users – you’ll grant user access to data via Unity Catalog. The access connector allows the workspace to access the data in the storage account on behalf of Unity Catalog users. This is why you must assign the Storage Blob Data Contributor to the access connector.
The confusing part of setup
If you are not a Databricks account administrator, you won’t see the option to create a metastore in the account console. If you aren’t an AAD Global Admin, you need an AAD Global Admin to log into the Databricks account console and assign your user to the account admin role. It’s quite possible that the AAD Global Admin(s) in your tenant don’t know and don’t care what Databricks or Unity Catalog is. And most data engineers are not global admin in their company’s tenant. If you think this requirement should be changed, feel free to vote for my idea here.
Once you have been assigned the account admin role in Databricks, you will see the button to create a metastore in the account console.
One or multiple metastores?
The Azure documentation recommends only creating one metastore per region and assigning that metastore to multiple workspaces. The current recommended best practice is to have one catalog that spans environments, business units, and teams. Currently, there is no technical limitation keeping you from creating multiple metastores. The recommendation is pointing you toward a single, centralized place to manage data and permissions.
Within your metastore, you can organize your data into catalogs and schemas. So it could be feasible to use only one metastore if you have all Databricks resources in one region.
In my first metastore, I’m using catalogs to distinguish environments (dev/test/prod) and schemas to distinguish business units. In my scenario, each business unit owns their own data. Within those schemas are external tables and views. Because it is most likely that someone would need to see all data for a specific business unit, this makes it easy to grant user access at the schema level.
I’ll save table creation and user access for another post. This post stops at getting you through all the setup steps to create your Unity Catalog metastore.
I’ve seen a few people start Azure Data Factory (ADF) projects assuming that we would have one source control repo per environment, meaning that you would attach a Git repo to Dev, and another Git repo to Test and another to Prod.
“Only the development factory is associated with a git repository. The test and production factories shouldn’t have a git repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource Management template.”
You’ll find the recommendation of only one repo connected to the development data factory echoed by other ADF practitioners, including myself.
We use source control largely to control code changes and for disaster recovery. I think the desire to use multiple repos is about disaster recovery more than anything. If something bad happens, you want to be able to access and re-deploy your code as quickly as possible. And since we start building in our repo-connected dev environment, some people feel “unprotected” in higher environments.
But Why Not Have All the Repos?
For me, there are two main reasons to have only one repository per solution tied only to the development data factory.
First, having multiple repos adds more complexity with very few benefits.
Having a repo per environment adds extra work in deployment. I can see no additional benefits for deployment from having a repo per environment for most ADF solutions. I won’t say never, but I have yet to encounter a situation where this would help.
When you deploy your data factory, whether you use ARM templates or individual JSON files, you are publishing to the live ADF service. This is what happens when you publish from your collaboration branch in source control to the live version in your development data factory. If you follow normal deployment patterns, you deploy from the main (if you use JSON files) or adf_publish (if you use the ARM template) branch in source control to the live service in Test (or whatever your next environment is). If your Test data factory is connected to a repo, you need to figure out how to get your code into that repo.
Would you copy your code to the other repo before you deploy to the service? What if something fails in your deployment process before deployment to the live service is complete? Would you want to revert your changes in the Git repo?
You could deploy to the live service first and skip that issue. But you still need to decide how to merge your code into the Test repo. You’ll need to handle any merge conflicts. And you’ll likely need to allow unrelated histories for the merge to work, so when you look back in your commit history, it probably won’t make sense.
At best, this Test repo becomes an additional place to store the code that is currently in Test. But if you are working through a healthy development process, you already had this code in your Dev repo. You might have even tagged it for release, so it’s easy to find. Your Git repo is likely already highly available, if it is cloud-based. In my mind, this just creates one more copy of your code that can get out of date, and one more deployment step. If you just want a copy of what is in Test or Prod for safe keeping, you can always export the resource’s ARM template. But if I were to do that, I would be inclined to keep it in blob storage or somewhere outside of a repo, since I already have the code in a repo. This would allow me to redeploy if my repo weren’t available.
Then, once you have sufficiently tested your data factory in Test, would you deploy code to Prod from the Test repo or from the Dev repo?
If you have the discipline and DevOps/automation capabilities to support these multiple repos, you likely don’t want to do this, unless you have requirements that mandate it. That brings me to my second reason.
Deviation from Common DevOps Practice
Having a repo per environment is a deviation from common software engineering practices. Most software engineering projects do not have separate repos per environment. They might have separate repos for different projects within a solution, but that is a different discussion.
If you have a separate repo for dev and test, what do you do about history? I think there is also a danger that people would want to make changes in higher environments instead of working through the normal development process because it seems more expedient at the time.
When you hire new data engineers or dev ops engineers (whoever maintains and deploys your data factories), you would have to explain this process with the multiple repos as it won’t be what they are expecting.
Unless you have some special requirements that dictate this need, I don’t see a good reason to deviate from common practice.
Common Development Process
For a data factory project, we must define a collaboration branch, usually Main. This branch is the only branch that can publish to the live service in your Dev data factory. When you need to update your data factory, you make a (hopefully short-lived) feature branch based off of your collaboration branch. My preference for a medium to large project is to have the Main branch, an Integration branch, and one or more feature branches. The Integration branch brings multiple features together for testing before the final push to Main. On smaller projects with one or two experienced developers, you may not need the integration branch. I find that I like the integration branch when I am working with people who are new to ADF, as it gives me a chance to tweak and execute new pipelines before they get to Main.
Developers work in the feature branches and then merge into the integration branch as they see fit. They resolve any errors and make any final changes in integration and then create a pull request to get their code into Main. Once the code is merged into Main and published to the live service (either manually or programmatically), the feature branches and Integration branch are deleted, preparing you to start the next round of development. Triggering the pipelines in the live service after publishing gives you a more realistic idea of execution times as ForEach activities may not run in parallel when executed in debug mode.
The code in Main should represent a version of your data factory that is ready to be deployed to Test. Code is deployed from Dev to Test according to your preference—I won’t get into all the options of JSON files vs ARM templates and DevOps pipelines vs PowerShell/custom code in this post.
You perform unit testing, integration testing, and performance testing (and any other type of testing as well, but most people aren’t really doing these three in any sufficient manner) in your Test data factory. If everything looks good, you deploy to Production. If there are issues, you go back to your development data factory, make a new feature branch from Main, and fix the issue.
If you find a bug in production, and you can’t use the current version of code in Main, you might want to create a hotfix/QFE. Your hotfix is still created in your development data factory. The difference is that instead of creating a feature branch from Main, you create the branch from the last commit deployed to production. If you are deploying via ARM templates, you can export the ARM template from that hotfix branch and manually check it in to the adf_publish branch. If you deploy from JSON files, selective deployment is a bit easier. I like to use ADF Tools for deployment, which allows me to specify which files should be deployed, so I can do a special hotfix deployment that doesn’t change any other objects that may have already been updated in Main in anticipation of the next deployment.
Having a repo per environment doesn’t technically break anything, but it adds complexity without significant benefits. It adds steps to your deployment process and deviates from industry standards. I won’t go so far as saying “never”, as I haven’t seen every project scenario. If you were considering going this route, I would encourage you to examine the reasons behind it and see if doing so would actually meet your needs and if your team can handle the added complexity.
Over the last few years, I’ve had a few people ask me why I don’t create two Twitter accounts so I can separate work and personal things.
I choose to use one account because I am more than a content feed, and I want to encourage others to be their whole selves. I want to acknowledge that we don’t work in isolation from what’s going on in the greater world.
I love learning things from people on Twitter. I find a lot of quick tips and SQL and Power BI blog posts from Twitter. I love seeing all the interesting data visualizations shared on Twitter.
But I also love learning personal things about people. I love to see photos of your dogs and views while hiking. I like knowing that my friend Justin is really into wrenches. And my coworker Joey (who I followed on Twitter long before I worked with him) is really into cycling. And my friend Matt is into racing. I followed all these people before I met them in person. Many of my Twitter friends became “IRL” friends after we met at conferences.
I definitely lurked around the SQL community (Power BI didn’t exist yet) for a while before I ever worked up the courage to say anything to anyone. And I started out with mostly data/work-related tweets. But as time went on, I realized how much I appreciated when others shared personal info about themselves, that helped me get to know them better. And I became more comfortable sharing more about me. So now I’m sort of giving back, with both professional and personal information.
Note: I do this personal/professional crossover only on Twitter, because I have deemed that to be an appropriate audience for both. I don’t share many personal things on LinkedIn. And I don’t share professional things on Instagram or Facebook. That’s my personal preference.
Are there risks to this? Absolutely.
People might stop following me because they don’t care for the personal content. My political opinions or obsession with my dog might turn some people off. Someone might be offended by my tattoos or my occasional use of profanity (which I never direct at someone and use more to express frustration/excitement/surprise). Or they may tire of my frequent use of gifs.
I know that potential and current clients see what I tweet. And it can and does affect their opinion of me. But when you hire me as your consultant, you get my personality and personal experiences as well as my tech knowledge. So, I don’t see it as being so very different from meeting me in real life at a conference or at another social event.
So far, it’s been an overall positive experience of having IRL conversations with clients about something I tweeted that they found helpful or entertaining. I do make a point not to name specific clients or projects unless I have their permission to do so (sometimes there are legitimateexceptions). I respect client privacy and confidentiality online and in person.
Before I could get to this place, I had to be secure in myself and secure in my employment and professional network. I recognize that not everyone will like me. That is true in person and on Twitter. And that’s ok. If you want to unfollow me, I’m ok with that. If you want to engage in conversations only about work stuff, that’s great. Feel free to mute words to tune out the rest of my tweets (your best bets are “Colorado”, “Denver”, “dog”, “kayak”, and “hike”). If you want to talk only about dogs and nature and my adorable and efficient camper, that’s also fine.
If you dig deep enough, I’m sure you can find some tweet that wasn’t my best moment. I’m sure I’ve said something regrettable in the 14 years since I joined Twitter. But I’m going to extend myself a little grace and remember that I’m human. And I’ll accept feedback if you think something I’ve said was unkind or made you feel unwelcome.
There is also a risk that someone can use this personal info against me, for harassment or identity theft. That is a risk for anyone sharing anything personal online. For now, I have assessed the risks and the rewards, and I find the rewards to outweigh the risks. I may decide differently later.
Do I recommend this for you?
Here’s the thing: I don’t think there are absolutes when it comes to how people use social media. If it makes you happy and it’s not hurting anyone, it’s probably fine.
It’s important to me that we remember that the people who teach us Azure and SQL and Power BI are people with non-work interests and personal struggles and interesting life experiences. And their more personal content gives me ways to relate to them outside of technology. Sometimes I learn really useful things from them, like the right type of lubricant to fix a squeaky door. Sometimes I notice a current event in their life that I can use to start a conversation at a conference.
Basically, I choose to use Twitter in a more human way. It’s working pretty well for me so far. You can decide if you have the desire and ability to do that. When this way of interacting with people stops being rewarding for me, I’ll reassess and take a different approach.
I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.
Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.
Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").
This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.
Case sensitivity in SQL Server
SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.
Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.
CREATE TABLE [dbo].[Products] (
[ProductName] nvarchar(100) NOT NULL
ALTER TABLE [dbo].[Products]
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)
I can insert my first row.
Insert into dbo.Products (ProductName)
Now I try to insert another row where the letters are capitalized.
Insert into dbo.Products (ProductName)
This fails with the following error:
Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).
This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).
If you need to get to a list of case insensitive distinct values, you can do the following:
Create a column that contains the values of the ProductName converted to lower case.
Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.
For more formal enterprise Power BI development, many people have a checklist to ensure data acquisition and data modeling quality and performance. Fewer people have a checklist for their data visualization. I’d like to offer some ideas for quality checks on the visual design of your Power BI report. I’ll update this list as I get feedback or new ideas.
The goal of my data visualization quality checklist is to ensure my report matches my intended message, audience, and navigation.
There are currently 4 sections to my PBI data viz quality check:
Visual components check
Can you explain the purpose/message of your report in a single sentence?
Can you explain how each visual on the page supports that purpose/message?
I use the term purpose more often when I have a report that supports more exploratory data viz, allowing users to filter and navigate to find their own meaning in their own decision contexts. Message is much easier to define in explanatory data viz, where I intend to communicate a (set of) conclusion(s). My purpose or message statement often involves defining my intended audience.
If you cannot define the purpose/message of your report page, your report may be unclear or unfocused. If you can’t identify how a visual supports the purpose/message, you may consider removing or changing the visual to improve clarity and usefulness.
You can perform a squint test by taking a step back and squinting while viewing your report page. Alternatively, you can use a desktop application or browser add-in that blurs the page, simulating far sightedness. Looking at the blurry report page helps you evaluate the visual hierarchy.
What elements on the page stand out? Should they? Areas of high contrast in color or size stand out. People read larger things first.
Does the page seem balanced? Is there significantly more white space or more bright color on one side of the page?
Does the page background stand out more than the foreground?
When visually scanning an image-heavy page (which follows a Z-pattern in Western cultures), does the order of items on the page make sense? Did you position explanatory text before the chart that needs the explanation? If there are slicers, buttons, or other items that require interaction before reading the rest of the page, are they placed near the top left?
Is there enough space between the items on the page to keep the page from feeling overly busy? Is it easy to tell where one visual ends and another begins?
Visual components check
I have two levels of visual components checks: reviewing individual visuals and reviewing the visuals across a report page.
Individual visuals components check:
Do charts have descriptive, purposeful titles? If possible, state a conclusion in the title. Otherwise, make it very clear what people should expect to find in your charts so they can decide if it’s worth the effort to further analyze them.
Are chart backgrounds transparent or using low-saturation colors? We don’t want a background color standing out more than the data points in the chart.
Are bright colors reserved for highlighting items that need attention?
Are visual borders too dark or intense? If every chart has a border that contrasts highly from the background, it can take the focus away from the chart and impede the visual flow. We often don’t need to use borders at all because we can use whitespace for visual separation.
Does the chart use jargon or acronyms that are unfamiliar to your intended audience? Try to spell out words, add explanatory text, and/or include navigation links/buttons to a glossary to reduce the amount of effort it takes to understand the report.
Are colors used within and across the page easily distinguishable?
Are fonts used consistently, with only purposeful deviations?
If charts should be compared, are the axis scales set to facilitate a reasonable comparison?
Does the interactivity between elements on the page provide useful information?
Are visuals appropriately aligned? Misalignment can be distracting.
I have a more comprehensive accessibility checklist on my blog that I keep updated as new accessibility features and tools are released. Below are some important things you can check to ensure your report can be read by those with different visual, motor, and cognitive conditions.
Do text and visual components have sufficient color contrast (generally, 4.5:1 for text and 3:1 for graphical components)?
Is color used as the only means of conveying information?
Is tab order set on all non-decorative visuals in each page? Decorative items should be hidden in tab order.
Has alt text been added to all non-decorative items on the page?
Is key information only accessible through an interaction? If so, you may consider rearranging your visuals so they are pre-filtered to make the important conclusion more obvious.
If you try navigating your report with a keyboard, is the experience acceptable for keyboard-only users? Does accessing important information require too many key presses? Are there interactive actions that can’t be performed using a keyboard?
Do you have any video, audio, or animation that auto-plays or cannot be controlled by the report user?