Implementing CI/CD With dbt
Written: June 5, 2024
Dbt & CI/CD
Introduction
One of dbt's core features is its ability to allow developers to version control the state of their data warehouse. This reduces human error during updates, enables graceful rollbacks, provides logs for auditing and debugging, and provides disaster recovery. These are some of the reasons why the "analytics engineer" term was coined. Many people attach the term "data ops" to these practices as well, though I would argue this is just one small component of DataOps.
If you're just getting started with dbt, this topic may be fairly advanced. You may want to try reading Starting With dbt first. Also, if you're new to CI/CD or want to focus more broadly on making Great CICD Workflows, you should check out my other post.
Design Goals
My goal is to discuss the process of constructing a pipeline and the considerations and trade-offs associated with creating that process in your CICD pipeline while using dbt. For example, what are the failure-modes of a CICD pipeline for dbt? Can my builds become too slow, too expensive, or just plain destructive? How can I prevent or mitigate those issues?
So, for this project, I will be using GitHub with Actions and/or Webhooks to store my dbt code and perform automation. There are some other CICD alternatives (Jenkins, Atlassian, Travis, Circle, Argo) but GitHub is widely used and very approachable. I will be using BigQuery as my data warehouse. Like other cloud-based options (e.g. Snowflake or Redshift), it is pay-as-you-go but unlike my Starting With dbt guide, it will have a much more affordable base cost than postgres on RDS and higher overall scalability.
I want to quickly address why I am not using IAC. So, while IAC is great for sharing projects that you can build yourself, it's not really going to help me illustrate these issues. There are also now many contenders for the de facto container portability solution (Docker, Podman, Kubernetes) and none of which will provide a realistic starting point for most analytics engineers who already have GitHub and a data warehouse managed for them.
Credit To Others
I found a great article on this topic by Stas Sajin on Medium.
Stay Tuned!
That's all for now but I'll be sure to have more soon. Send me messages and tell me what I'm doing right or wrong!