I'm Nicholas Yager! I'm a Principal Analytics Engineer at HubSpot, and previously I've and today I'm going to talk with you about dbt projects and gardens.
Well... not just any type of garden. Specifically, overgrown gardens. I LOVE overgrown gardens. They are monuments to nature's ability to create life, and for living organism's ability to grow into nearly any niche available to them. The downside to overgrown gardens, is that they are impossible to for a gardener to maintain. If we want to start planing productive crops and beautiful flowers, we're at the mercy of the verdant life already there. We as gardeners implement our designs within _this_ garden as it already exists. Of course, you're not here to discuss horticulture per se. This is a data conference, after all. So, bring dbt DAGs into this.
I'm reasonably confident that you all are familiar with this type of image. A tangled web of references that is nearly inscrutable at first glance, and that's if you're willing to wait for the dbt docs render. Now, marketing images and trivial guides always have dbt projects with are neat, tidy DAGs that are easy to view on one screen. But, this is not the reality for most dbt users! Organizations in the real world have more than perfect representations for transactions and customers. Business processes can be really complex, and this necessitates that your dbt projects be complex enough to encapsulate these processes as core entities. Having that been said, there is a fine line between a project being complex, and being messy. In my opinion, complexity is a byproduct of constraints -- you must work with the data you have, and sometimes this requires extra layers of abstraction and processing. Messiness and clutter, however, happen as an issue of scale. Quite simply
As organizations grow, they trend towards decentralization. It's very easy to keep all of your analytics talent aligned and in one central org while you have 5, 10, heck 20 people, but at a certain point, it is not longer effective to have once central block of AE talent. Many organizations follow a two-pizza rule -- the idea that effective teams are usually those fed by two pizzas -- and as such will decompose teams into specific domains. This decentralization then opens the door to drift.
Once you're in territory where there are multiple distinct teams operating independently, it's common for there to be drift in how the teams operate. This can be something as trivial as leading commas vs trailing commas. It can also ne something as important as the definition of a Customer. Perhaps Sales analytics defines a customer as a CRM account with subscription, whereas a finance analytics team may define a customer as a corporate entity. This definition mismatch means that these two analytics teams now have entirely incompatible customer reporting. This adds to overhead and makes it difficult for these teams to collaborate.
And now we're in the endgame. Do we put in the effort to have Sales analytics align with Finance or vis versa, or do we make just a few more models that shim together bits of both to workaround the reporting difference? dbt Core is a wondrous tool. makes it delightfully easy to reference existing models and start pulling in data from somewhere else in the project. This comes at a price, however, in that each new model we add to workaround our differences is one more liability for our team in the future.
This leads to an overgrown garden that, while full of life, is a nightmare to reason about and to maintain. So, if we want to make our overgrown dbt projects maintainable and ergonomic, it stands to reason that we can work like gardeners to rehabilitate them. Thankfully, gardeners have been working on this problem for a _long_ time, and there are five reasonably accepted steps for rehabilitating an overgrown garden.
![bg](https://c2.staticflickr.com/4/3092/5809787537_7d926fca26_b.jpg)
1. Remove deprecated or otherwise unused models 2. Consolidate duplicate models
Gradual removal of unproductive branches to allow a plant to spend its resources growing healthy branches. Within the context of a dbt project, this is a matter of identifying anti-patterns and other non-productive structures in your DAG, and refactoring them to make them more ergonomic.
Based on my experience working as a consultant on dozens of dbt projects, there are four main anti-patterns that tend to add unnecessary messiness to a project. <talk about them!>
While these are relatively simple patterns to look out for, we're talking about large projects here, so it would behoove us to not perform this type of pruning manually. Thankfully, there are some great open source tools to make this process easier.
It's dangerous to go alone! Take these.
First and foremost, there is the dbt-project-evaluator package, which leverages Jinja and other introspection features of dbt macros to dig into the project's DAG. This means that there are no third-party dependencies, and you can bake architectural auditing into your dbt testing practices. If you want to learn more about this, I recommend watching Grace Goheen's Coalesce 2022 presentation on the dbt-project-evaluator package.
There are also third party tools that enable auditing dbt project architecture. One open source third party tool is Whetstone, which provides a handy web UI for inspecting your project's architectural patterns, and it integrates with dbt Cloud webhooks so new analyses can be performed after every dbt run. This brings us to the beefiest of the steps...
Dividing the perennials is the notion that we ought to separate your most industrious plants to prevent overcrowding and to allow for specialized treatment of plants in the garden. Within the context of a dbt project, your perennials are your core entities. Your core entity models _are_ the building blocks for your analytics practice, and as such accrete downstream data consumers. We can to divide our perennials by partitioning the subgraphs of our dbt project that feed into our core entities, and define robust interfaces for these core entities to standardize access patterns and to reduce sprawl in vital sections of our project.
This brings us to model governance features like groups, access, and versions. Groups and access allow dbt users to add permissions around which models can be accessed, and versions allow for creating documented versioned models that can be contracted and deprecated. Together, these tools allow AEs to treat their models like robust interfaces for other teams. Let's look at an example.
Suppose we have a tiny toy model with CRM data, product data, core entities for deals and user events, and a one-off report model made by a product analyst. Notice that we're breaking an architectural guideline! We're duplicating the same joins. This tells us that `report_product_outcomes` is not using our clean, beautiful entities. We can enforce use of these entity models by assigning these models to groups, and setting our staging models to private access to prevent models from other groups from accessing them.
This is quite easy to do for the YAML fans in the audience! We can define a groups via YAMl, like the revenue and customer success groups in the first code block. Then, we can assign our models to our groups of interest. Not that we are defining deals as a public model and stg_crm__customers as private.
If we were to go though and assign our other models to their corresponding groups, we would have public models for deals and user events, and the report models would be required to use our public models as a common interface! With common interfaces in place, we are able to carve out areas of decreased mess _and_ complexity. Now, suppose we need to make breaking changes to the `deals` model. We now have hard dependencies downstream that we need to be mindful of. First and foremost, talk with your data consumers, but from a technical perspective, we can leverage versions!
Model versions are a way of dynamically aliasing refs to an appropriate version of your model. Versions are defined in model property yaml, and each version can have specific column definitions and metadata. Each version has a specific version-suffixed SQL file definition, but they all share the same model properties. In our scenario, suppose we're dropping the `favorite_color` field from our deals model. We can create a pre-release of a v2 of the `deals` model that with the appropriate documentation, and set a deprecation date on the v1 of deals.
Now, when `report_product_outcomes` references deals, dbt will log a warning the `deals_v1` has a deprecation date, and will suggest that the model be updated to use the next version.
When we're ready to cut over to the new version, we only need to update the `latest_version` property, and now all refs to the latest `deals` model will automatically start using the v2 version of the deals model. Once the transition has completed, it's easy to remove the v1 model of deals and drop the table to clean up the unused model.
To make it easier for practitioners to apply groups, access, and versions in large projects, a handful of dbt community members have made the `dbt-meshify` python package. This tool allows you to these model governance tools to dbt projects using selector syntax. For example, imagine adding a `Revenue` group to all models upstream to deals and setting deals to public access using a single command. Expect more from this tool as the dbt community explores how to effectively use these new model governance features.
And now we're really going to push the frontier into multi-project deployments. You may familiar with this under the name "multi-project collaboration", or cross-project references, but multi-project deployments is the practice of decomposing large dbt projects into smaller, self-contained dbt projects that have clear robust interfaces, and that can been seen as black-boxes by downstream consumers.
Using our previous toy DAG as an example, we could conceivably split our project into two smaller projects -- one for go-to-market analytics and one for product. This separation adds operational complexity -- you will have different projects to orchestrate and depending on your setup there may be separate repos or other separate parts of the workflow. But, there are three main benefits. 1. Teams have more flexibility and self-determination 2. Clear lines of ownership and responsibility for all models. - Each project can be treated like a black-box. If we can trust the interfaces between our projects, then we can reduce the amount of organizational overhead 3. By enforcing the use of public interfaces, we will necessarily promote more robust and less messy architectural practices. We've made undesirable states for our project unrepresentable! Now, let's get into the prickles. Multi-project deployments are on the bleeding edge of dbt practice. Official support of multi-project deployments is very new, with the underlying capability within dbt Core having only been introduced in dbt 1.6.0 with the introduction of their plugin system.
ADditionally, dbt Labs has decided to restrict this functionality to dbt Cloud enterprise users using either the cloud IDE or the in-beta cloud CLI. And on top of that use of multi-project deployments is being limited to a closed beta program of customers. But ... the beauty of open source code is that we can look at the code, see the outline of what's missing, and infer how it works!
To that end, I've created the dbt-loom python package, an open source and open-licensed dbt plugin that enables native multi-project deployments. After adding this package as python dependency along side dbt core and creating a config file, this package identifies public models from upstream projects and injects them into your downstream project to enable cross-project references. Once again, mutli-project deployments are in their early days, but I suspect that this technique will become much more common as the dbt community starts embracing smaller, more manageable projects with clean interfaces as a way to escape maintaining massive monolith projects.
Now that we've gotten our garden into a more maintainable state state, it's vital that we prevent weeds and other unwanted plants from taking root. In a garden there are myriad tools and techniques for weeding your garden, but the most important factor has nothing to do with your tools.
When it comes to ongoing maintained, I am of the opinion that your _process_ is more important than your tools. It's a matter of deliberate practice. A matter of interspersing small amounts of maintenance work into your everyday tasks. With this in mind, I have three main recommendations:
Showing DAG changes makes it easy to contextualize what changes are being made. The human brain is an amazingly powerful prediction machine that takes visual input and simulates a universe. We can leverage this ability to pattern match and predict by looking at how your DAG changes structurally. Similarly, stick to a consistent SQL formatting convention. When your project is standardized in its format, it becomes very easy to instinctively recognize non-performant code -- it will literally look a little funny once your team has become accustomed to what good looks like. Lastly, use TESTS! Mistakes happen, and enforcing even minimal testing can go a long way.
We've come a long way! Our project now has fewer unused models, a more efficient structure, clearly-delineated responsibilities, and automated guardrails to keep new growth in check. We have taken an overgrown project, and turned it into a robust, maintainable platform for future development.
So, we can now take a small break. Take a breath. Enjoy our handy work. and when we're ready, we can grow an even better future.