Lately, I have been using SSIS execution frameworks and Biml created by other people to populate data marts and data warehouses. It has taught me a few things and helped me clarify what I like and dislike compared to my usual framework. I’ve got the beginning of my preferences list started below. There are probably situations where I would want to deviate from my preferences, but I think they make a good starting point.
Populating Data
- For self-service BI environments, a date dimension that doesn’t go out much further than the greatest date in your data. This can be a view or stored procedure that limits and updates dates rather than a static date dimension that goes out until the end of time.
- Unknown values are included in normal dimension loads, not in separate scripts that must be run on deployment. This way, if an unknown value is ever left out or deleted, it will be added in the next data load rather than requiring a special execution of a script.
- Every table should have InsertDateTime and UpdateDateTime columns. The UpdateDateTime column should be populated with the same value as the InsertDateTime column upon creation of the row, rather than being left null.
- Whatever you use to create tables, include primary keys, foreign keys, and indexes with your table definitions. Provide explicit constraint names to simplify database comparisons. You can disable your foreign keys, but they need to be there to provide that metadata.
- Separate your final dimensional/reporting tables from audit tables and staging tables. This can be done with separate schemas or even separate databases.
Data Integration Process
- There should be consistent error handling in each layer (staging, dims, facts, etc.). If you write errors to another location (flat file, database table), have a process that notifies the right people that errors occurred. The process of consuming corrected data must be built, tested, and integrated into the existing process.
- Make your error handling process reflect what end users need to see when an error occurs. Does it make sense to have a partial load when there is an issue? Or should it be all or nothing?
- Have smart master packages that determine which packages to run. Don’t check whether the package should run inside of the package itself – do that in the master package.
- Master packages should execute child packages in parallel as much as possible rather than defaulting to sequential execution.
- Have an audit log with one row per package. Include the SSIS ServerExecutionID in the audit log – not the package -specific ID but the execution ID for the entire run. If there are incremental loads, the where clause used to filter the load should be captured in the audit table. Include row counts as well as package start and stop time in your audit log.
- Add an AuditLogID column on your dimension, fact, and staging tables so you can trace each row back to the process that populated it.
- For dims and facts, perform change detection/deduplication of records, usually through hash values and either SSIS lookups or SQL queries with WHERE NOT EXISTS.
- Avoid T-SQL MERGE statements. Write individual insert/update/delete statements. This avoid any bugs in MERGE and makes your SQL easier to understand and troubleshoot.
- Use consistent naming of tasks, source, destinations, packages, connection managers, etc. Connection managers pointing to databases should have names that refer to the database rather than the server.
- If you are downloading files, move the files to an archive folder once files are processed. You can have rules in place if you have retention limits. But you probably need to keep files from at least the last load for audit and troubleshooting purposes. This could change if you are importing very sensitive data.
- Even if you need to copy all columns from a table, write a select statement for database sources that explicitly names fields rather than using SELECT *. or just selecting the table or view.
- SSIS lookups should use an explicit query rather than referencing an entire table.
- Implement restartability at the package level for most packages (you should have single-purpose packages executed by a master package). Checkpoints are ineffective within a package. If you build your audit log table correctly, you can get the list of packages that have not run in the last X minutes/hours and feed that to your master package.
- Send email from your scheduling tool rather than within an SSIS package.
- Track data lineage in your tables. This can be as simple as having a table that lists all of your data sources with an ID column and including that ID value in each row of your staging, fact, and dimension tables.
- Dims and facts are not truncated. Data should be inserted and updated (and deleted, if necessary).
- Connection strings used in multiple packages should be project-level connection strings.
Biml Specifics
- Understand whether you need a flexible Biml Framework or just an accelerator for a current project. If you need flexibility, don’t hardcode connection strings and other things that change when you add/change sources and destinations. If you just need to accelerate development of a simple data mart, total flexibility may be overkill and actually cause more work.
- Have a single place where you add synthetic metadata, as much as possible. BimlScript gets messy and difficult to understand when you have some extended properties that are read in, some annotations added directly, and some variables defined in your code. This is why I like synthetic metadata stored in a database. Also, extended properties don’t exist in Azure SQL Data Warehouse, so if you need your framework to work there you can’t go that route.
- Don’t repeat your code in multiple files. If you have some logic that gets reused, move it to a separate file and reference it from other files.
What Do You Think?
What’s on your SSIS preferences list? Do you disagree with one of my preferences and want to share your knowledge? Let’s chat in the comments.
“…This is why I like synthetic metadata stored in a database.”
Yes! Or, at least, in some easily retrievable form (Azure tables and whatnot.) This allows easy maintenance of both the metadata and the scripts/templates that make use of it because, if nothing else, separation of concerns.
The same metadata should be usable for more than one script, meaning it may well be a super set of what any one script might need.
Yes! Exactly. Yeah, I should have just said some easily retrieveable form.
I would agree with almost everything you’ve said there. We do use extended props. (This can cause issues in SSDT change detection so we use our own tables and sprigs to get & set and keep track of values which lets us use the same approach in Azure SQL DWH. We also export out our meta data and current ext props to flat files we can push into version control. Having some of your config in DB tables and the rest in code files always feels a bit risky to me. Having it all aligned in version control makes me less twitchy
Yes, good point. We should definitely back up or source control our metadata.
Nice suggestions, if you don’t mind I would like to add 2 other preferences:
1. Try as much as possible to read huge volume of data in chunks: https://stackoverflow.com/questions/54168322/reading-huge-volume-of-data-from-sqlite-to-sql-server-fails-at-pre-execute/54172707#54172707
2. When using OLE DB Source select only needed columns (With primary key): https://dba.stackexchange.com/questions/176930/ssis-ole-db-source-editor-data-access-mode-sql-command-vs-table-or-view/176933?stw=2#176933
Your second point is covered with “SSIS lookups should use an explicit query rather than referencing an entire table”, just not with the level of detail you provided. Thanks for that.
I agree that both points are very similar, but it is worth to mention that we must do the same thing with OLE DB Source (not only lookups)
I am not familiar with “synthetic metadata”. Can you elaborate? A quick Google search did not yield any useful results (that I could tell). Thanks, David
Hi, David. Natural metadata is metadata that exists as an artifact of pre-existing processes. This includes database schemas and self-describing web services. Hybrid metadata is usually simple key/value pairs of information that we add on. This includes Biml annotations and database extended properties. Synthetic metadata usually involves a standalone metadata store that captures logic and pattern selection. This can be stored in a database or Excel files or web services. What I’m advocating in my blog post above, is that if you know you have a bunch of synthetic metadata, go ahead and store everything in one place, rather than having some extended properties, some biml annotations, and some excel spreadsheet. I want it to be clear to the people that will maintain or enhance my code where they should look for that information.
Intuitively, that is what I assumed so at least my intuition is working. 🙂 Thanks so much. I totally agree with this. I am trying to do this with a system we are using on campus where the metadata and logic are scattered about. Very confusing to say the least, and very hard to maintain.
Thanks,
David
Like your preference list! Only one question. “Send email from your scheduling tool rather than within an SSIS package.” Why do you like that approach? I ask because our Operational DBAs said no to letting job agent send emails but is OK with using SMTP within an SSIS task. Can’t remember why they argued for that, but interested in your position!
Tim Mitchell has a good blog post about this that explains several reasons. I agree with all of them. https://www.timmitchell.net/post/2016/07/14/get-your-email-out-of-my-etl/