Starting With dbt

Started: May 17, 2024
Last Updated: June 5, 2024

Why I'm Interested In dbt

I'm writing about dbt because it's everywhere. One job description after another lists "dbt" as a requirement. I see what it does on paper: version control, CI/CD, testing, docs, dependency management. I've pushed for all of these things as a software engineer and dbt claims to do this for analysts. As a data engineer, I have seen that it can be challenging to manage data quality and lineage as your company builds table on top of table on top of table in medallion architecture, so maybe there is something here.

But, I have sooo many questions:

Do I actually need dbt to do these things? And if not, does it make it substantively better?
How does this CLI tool that I run locally process data at scale? Or, if it delegates to another engine, how does it know that it works?
How does dbt handle fundamental gaps & challenges with software testing like concisely capturing expectations without doubling the work of the developer or its data and distributed systems challenges of realistic test data or performance & integration testing?
Does dbt enable a more balanced distribution of responsibility around data quality that frees up data engineers to do more impactful work?
Why do so many data engineering job descriptions look for dbt if it's intended for analysts and non-engineers?

My goal in this article (series?) will be to answer those questions, answer any questions you might have, and basically get up to speed while documenting my journey.

What is dbt?

Data Build Tool helps analysts work like engineers to enable make common data engineering tasks easy to manage. There's lots of information out there on dbt's site and on Youtube. In that video, Carly Kaufman attributes the coining of the phrase "analytics engineering" to dbt's philosophy of enabling analysts to work like engineers. Specifically, dbt aims to provide better version control and CI/CD, testing and documentation, and dependency management. It's also increasingly common for job descriptions for data engineers. There's some huge clients on their site: BHP, Condé Nast, Domain, HubSpot, jetBlue, Nasdaq, Vestas, Sunrun, code42, and McDonald's Nordics.

How dbt Positions Themselves

This photo from dbt's site helps position it within a set of tools commonly used by their customers. It helps people know what integrations exist and what tools it does not replace.

How I Think About dbt

I want to add more color for the data engineers and analysts. The core flow for dbt is to write some SELECT queries and YAML configs, check it into VCS, and let a scheduler run dbt to wrap and execute your queries as DDL/DML in a database.

Another Great Resource

If you're in a "listen to talks and sip coffee while writing code", there's a great video featuring Carly Kaufman from dbt Labs (formerly Fishtown Analytics) does an amazing job explaining the motivations, use cases, and core concepts.

Installation

Finding the Right Guide

At first, I chose to follow dbt's getting started guide for GCP (because I also want to learn more about GCP and it's becoming ever more popular for ML work). There are several other guides as well for Azure Synapse, Databricks, Microsoft Fabric, Redshift, Snowflake, and Starburst. These aren't the only data platforms that are compatible with dbt though, there's trusted adapters (Spark, Athena, Glue, Dremio, Postgres) and community adapters (ClickHouse, DuckDB, Hive, MySQL, Upsolver) too.

When I saw that the first steps were to make an account, I was pretty sure I took a wrong turn somewhere. And yes, that wrong step was not realizing that "dbt cloud" is of course, dbt Lab's SaaS offering while "dbt core" is an open source tool. In fact, there's a whole page on their site about it. Silly me.

Now then, dbt's site has steered me to a guide for using dbt core with GitHub Codespaces but they have another for pip and Docker users as well as source installations. The Codespaces approach might save me 10 minutes but I don't always use Github for work. All three other options are readily available on a local laptop though and might give me more flexibility for my repo layout.

Installing dbt

Proceeding with the pip guide, there's only two major steps:

Setup a virtual environment (recommended)
Install a dbt adapter for your project

Following the guide, you can just copy paste instructions but if you want my take on their approach, keep reading. Please keep in mind that these code snippets are specific to my machine and goals.

Virtual Environment

The dbt guide recommends creating a virtual environment to install dbt and I agree. They do also suggest configuring a global alias to activate the environment. I think this second step is more a matter of preference. You'll need to run "env_dbt" every time you open a terminal before you can use dbt so if you do end up with more dbt installations for some reason, you will need to create separate aliases (or just use "source" directly or in a script).

python -m venv dbt-env # create the environment

vim ~.bash_profile

Add an alias to the file:

alias env_dbt='source /Users/Chris/Code/CloudDemo/dbt-env/bin/activate'

Type ":wq" to save and quit.

source !$ # Update shell with new profile

env_dbt # Activate the environment

Later on I find that organizing your project is a little weird. It seems like this environment directory should probably be separate from your project entirely. If you don't, it's not a big deal but you might want to make sure you update your .gitignore file to exclude the venv.

Adapter

The dbt guide shows how to install dbt-core and an adapter dbt-postgres together. I'll stick with Postgres too, for now. It's familiar and lightweight and lets me explore dbt right away without dealing with GCP just yet.

Since dbt-postgres depends on dbt-core, one command installs both. Assuming that you still have the dbt-env environment activated, this will put your dbt packages in dbt-env/lib/python<version>/site-packages/ instead of your system Python's lib path.

python -m pip install dbt-core dbt-postgres # Install dbt with pg adapter

Connecting Datasources

Installing Postgres

Before we can connect, we must make sure Postgres is installed and that the server is running. The postgres download site shares a link to Postgres.app which might be easier to manage for non-admin readers. I will be using a previously installed and initialized server. I'm also using pgAdmin as a diagnostic aide.

Connecting With dbt

I'm using the dbt postgres connection guide if you want to follow along. Since we're using dbt-core, not dbt-cloud, we need to create and update dbt's config file profiles.yml. However, there's more setup for our dbt project required and creating profiles.yml now will potentially interfere with that. Not to mention, I have no idea where it goes.

Initializing Your dbt Project

Initializing a dbt project is seemingly easy. Just type dbt init and follow the interactive shell to configure your project. You can't name it "dbt" and you cannot use hyphens "-" which is a bit annoying. You will also generate a "logs" directory. If dbt init ran successfully, you can delete this log directory. Future dbt commands need to be run within the generated directory and will also produce log directories that you should list in your .gitignore. Also, if you configure a connection profile (e.g. postgres) the credentials will be stored in "~/.dbt/profiles.yml". The security considerations may or may not be important depending on your company's threat model.

Logging

Now is a decent time to mention dbt's logging. Whenever you run dbt, it will log to files. You can change the default log level or override it with each command. You can also change the log location. Typically, if you are in a dbt project, logs go to the project's log directory which is usually <project root>/logs/ but outside of a dbt project, it's the current working directory (ie '.'). You can learn more here: https://docs.getdbt.com/reference/global-configs/logs

Testing The Connection

Ultimately, we want to be able to build, test, and run our dbt project.

It's time to "dbt run", right? No, it's not:

(dbt-env) [Chris@/Users/Chris/Code/CloudDemo]$ dbt build

15:45:35 Running with dbt=1.8.0

15:45:35 Encountered an error:

Runtime Error

No dbt_project.yml found at expected path /Users/Chris/Code/CloudDemo/dbt_project.yml

Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml

It's time to "dbt build"! Almost:

(dbt-env) [Chris@/Users/Chris/Code/CloudDemo/dbt_demo]$ dbt build

15:45:46 Running with dbt=1.8.0

15:45:47 Registered adapter: postgres=1.8.0

15:45:47 Unable to do partial parsing because saved manifest not found. Starting full parse.

15:45:48 Found 2 models, 4 data tests, 413 macros

15:45:48

15:45:49 Concurrency: 3 threads (target='dev')

15:45:49

15:45:49 1 of 6 START sql table model dbt.my_first_dbt_model ............................ [RUN]

15:45:49 1 of 6 OK created sql table model dbt.my_first_dbt_model ....................... [SELECT 2 in 0.25s]

15:45:49 2 of 6 START test not_null_my_first_dbt_model_id ............................... [RUN]

15:45:49 3 of 6 START test unique_my_first_dbt_model_id ................................. [RUN]

15:45:49 2 of 6 FAIL 1 not_null_my_first_dbt_model_id ................................... [FAIL 1 in 0.08s]

15:45:49 3 of 6 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.08s]

15:45:49 4 of 6 SKIP relation dbt.my_second_dbt_model ................................... [SKIP]

15:45:49 5 of 6 SKIP test not_null_my_second_dbt_model_id ............................... [SKIP]

15:45:49 6 of 6 SKIP test unique_my_second_dbt_model_id ................................. [SKIP]