Starting With dbt

Started: May 17, 2024
Last Updated: June 5, 2024

Why I'm Interested In dbt

I'm writing about dbt because it's everywhere. One job description after another lists "dbt" as a requirement. I see what it does on paper: version control, CI/CD, testing, docs, dependency management. I've pushed for all of these things as a software engineer and dbt claims to do this for analysts. As a data engineer, I have seen that it can be challenging to manage data quality and lineage as your company builds table on top of table on top of table in medallion architecture, so maybe there is something here. 

But, I have sooo many questions:


My goal in this article (series?) will be to answer those questions, answer any questions you might have, and basically get up to speed while documenting my journey.

What is dbt?

Data Build Tool helps analysts work like engineers to enable make common data engineering tasks easy to manage. There's lots of information out there on dbt's site and on Youtube. In that video, Carly Kaufman attributes the coining of the phrase "analytics engineering" to dbt's philosophy of enabling analysts to work like engineers. Specifically, dbt aims to provide better version control and CI/CD, testing and documentation, and dependency management. It's also increasingly common for job descriptions for data engineers. There's some huge clients on their site: BHP, Condé Nast, Domain, HubSpot, jetBlue, Nasdaq, Vestas, Sunrun, code42, and McDonald's Nordics.

How dbt Positions Themselves

This photo from dbt's site helps position it within a set of tools commonly used by their customers. It helps people know what  integrations exist and what tools it does not replace.

How I Think About dbt

I want to add more color for the data engineers and analysts. The core flow for dbt is to write some SELECT queries and YAML configs, check it into VCS, and let a scheduler run dbt to wrap and execute your queries as DDL/DML in a database.

Another Great Resource

If you're in a "listen to talks and sip coffee while writing code", there's a great video featuring Carly Kaufman from dbt Labs (formerly Fishtown Analytics) does an amazing job explaining the motivations, use cases, and core concepts.

Installation

Finding the Right Guide

At first, I chose to follow dbt's getting started guide for GCP (because I also want to learn more about GCP and it's becoming ever more popular for ML work). There are several other guides as well for Azure Synapse, Databricks, Microsoft Fabric, Redshift, Snowflake, and Starburst. These aren't the only data platforms that are compatible with dbt though, there's trusted adapters (Spark, Athena, Glue, Dremio, Postgres) and community adapters (ClickHouse, DuckDB, Hive, MySQL, Upsolver) too.

When I saw that the first steps were to make an account, I was pretty sure I took a wrong turn somewhere. And yes, that wrong step was not realizing that "dbt cloud" is of course, dbt Lab's SaaS offering while "dbt core" is an open source tool. In fact, there's a whole page on their site about it. Silly me.

Now then, dbt's site has steered me to a guide for using dbt core with GitHub Codespaces but they have another for pip and Docker users as well as source installations. The Codespaces approach might save me 10 minutes but I don't always use Github for work. All three other options  are readily available on a local laptop though and might give me more flexibility for my repo layout.

Installing dbt

Proceeding with the pip guide, there's only two major steps:

Following the guide, you can just copy paste instructions but if you want my take on their approach, keep reading. Please keep in mind that these code snippets are specific to my machine and goals.

Virtual Environment

The dbt guide recommends creating a virtual environment to install dbt and I agree. They do also suggest configuring a global alias to activate the environment. I think this second step is more a matter of preference. You'll need to run "env_dbt" every time you open a terminal before you can use dbt so if you do end up with more dbt installations for some reason, you will need to create separate aliases (or just use "source" directly or in a script).

python -m venv dbt-env           # create the environment

vim ~.bash_profile

Add an alias to the file:

alias env_dbt='source /Users/Chris/Code/CloudDemo/dbt-env/bin/activate'

Type ":wq" to save and quit.

source !$      # Update shell with new profile

env_dbt        # Activate the environment

Later on I find that organizing your project is a little weird. It seems like this environment directory should probably be separate from your project entirely. If you don't, it's not a big deal but you might want to make sure you update your .gitignore file to exclude the venv.

Adapter

The dbt guide shows how to install dbt-core and an adapter dbt-postgres together. I'll stick with Postgres too, for now. It's familiar and lightweight and lets me explore dbt right away without dealing with GCP just yet.

Since dbt-postgres depends on dbt-core, one command installs both. Assuming that you still have the dbt-env environment activated, this will put your dbt packages in dbt-env/lib/python<version>/site-packages/ instead of your system Python's lib path.

python -m pip install dbt-core dbt-postgres   # Install dbt with pg adapter

Connecting Datasources

Installing Postgres

Before we can connect, we must make sure Postgres is installed and that the server is running. The postgres download site shares a link to Postgres.app which might be easier to manage for non-admin readers. I will be using a previously installed and initialized server. I'm also using pgAdmin as a diagnostic aide.

Connecting With dbt

I'm using the dbt postgres connection guide if you want to follow along. Since we're using dbt-core, not dbt-cloud, we need to create and update dbt's config file profiles.yml. However, there's more setup for our dbt project required and creating profiles.yml now will potentially interfere with that. Not to mention, I have no idea where it goes.

Initializing Your dbt Project

Initializing a dbt project is seemingly easy. Just type dbt init and follow the interactive shell to configure your project. You can't name it "dbt" and you cannot use hyphens "-" which is a bit annoying. You will also generate a "logs" directory. If dbt init ran successfully, you can delete this log directory. Future dbt commands need to be run within the generated directory and will also produce log directories that you should list in your .gitignore. Also, if you configure a connection profile (e.g. postgres) the credentials will be stored in "~/.dbt/profiles.yml". The security considerations may or may not be important depending on your company's threat model.

Logging

Now is a decent time to mention dbt's logging. Whenever you run dbt, it will log to files. You can change the default log level or override it with each command. You can also change the log location. Typically, if you are in a dbt project, logs go to the project's log directory which is usually <project root>/logs/ but outside of a dbt project, it's the current working directory (ie '.').  You can learn more here: https://docs.getdbt.com/reference/global-configs/logs

Testing The Connection

Ultimately, we want to be able to build, test, and run our dbt project.

It's time to "dbt run", right? No, it's not:

(dbt-env) [Chris@/Users/Chris/Code/CloudDemo]$ dbt build

15:45:35  Running with dbt=1.8.0

15:45:35  Encountered an error:

Runtime Error

  No dbt_project.yml found at expected path /Users/Chris/Code/CloudDemo/dbt_project.yml

  Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml


It's time to "dbt build"! Almost:

(dbt-env) [Chris@/Users/Chris/Code/CloudDemo/dbt_demo]$ dbt build

15:45:46  Running with dbt=1.8.0

15:45:47  Registered adapter: postgres=1.8.0

15:45:47  Unable to do partial parsing because saved manifest not found. Starting full parse.

15:45:48  Found 2 models, 4 data tests, 413 macros

15:45:48  

15:45:49  Concurrency: 3 threads (target='dev')

15:45:49  

15:45:49  1 of 6 START sql table model dbt.my_first_dbt_model ............................ [RUN]

15:45:49  1 of 6 OK created sql table model dbt.my_first_dbt_model ....................... [SELECT 2 in 0.25s]

15:45:49  2 of 6 START test not_null_my_first_dbt_model_id ............................... [RUN]

15:45:49  3 of 6 START test unique_my_first_dbt_model_id ................................. [RUN]

15:45:49  2 of 6 FAIL 1 not_null_my_first_dbt_model_id ................................... [FAIL 1 in 0.08s]

15:45:49  3 of 6 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.08s]

15:45:49  4 of 6 SKIP relation dbt.my_second_dbt_model ................................... [SKIP]

15:45:49  5 of 6 SKIP test not_null_my_second_dbt_model_id ............................... [SKIP]

15:45:49  6 of 6 SKIP test unique_my_second_dbt_model_id ................................. [SKIP]

15:45:49  

15:45:49  Finished running 1 table model, 4 data tests, 1 view model in 0 hours 0 minutes and 0.64 seconds (0.64s).

15:45:49  

15:45:49  Completed with 1 error and 0 warnings:

15:45:49  

15:45:49  Failure in test not_null_my_first_dbt_model_id (models/example/schema.yml)

15:45:49    Got 1 result, configured to fail if != 0

15:45:49  

15:45:49    compiled code at target/compiled/dbt_demo/models/example/schema.yml/not_null_my_first_dbt_model_id.sql

15:45:49  

15:45:49  Done. PASS=2 WARN=0 ERROR=1 SKIP=3 TOTAL=6

This is actually a good error. It failed because we haven't made any changes to our dbt models but you can see in the logs and in pgAdmin that it's created "my_first_dbt_model" which is amazing!

While your project is not passing tests, the build still succeeds and you can now also call "dbt run".

Creating Documentation

It's worth briefly touching on dbt's documentation functionality before getting into models because it's easy to overlook docs but this is a really valuable feature for data quality and governance that can help a lot of companies hold out over paid data cataloging and lineage solutions like Stemma and Monte Carlo until they are ready to make the investment. At the very least, this provides a diagnostic tool for developers.

From your dbt folder, you just need to run a few commands:

dbt docs generate

dbt docs serve

This should open automatically in your browser or you can go to http://localhost:8080.

Version Control

This is one of dbt's claims to fame. Specifically, the claim is that dbt enables you to version control the state of your data warehouse. So, I set up a Github repository and started committing code. If you are not familiar with version control, you should ask the folks at github.com. Another related claim is CI/CD to which we will return later but first we should talk about security.

Security Considerations!

Since this is a demo project, I didn't think much of blindly committing code. However, if you're doing this for work, you should be more careful.

Remember our search for profiles.yml and how we found that "dbt init" handled everything for us? That's when we should have started asking "where are my credentials stored?" Or perhaps, dbt should better inform users about what they do with your credentials. While "dbt init" does obscure your password in its interactive shell script and does store it outside of your project so you don't accidentally commit it to version control or otherwise proliferate it.

However, it's important to note that it is written to ~/.dbt/profiles.yml in plaintext by default. Protecting access to this file and being careful about how you read, copy, or view it is critical. You can inject the password via a Jinja template expression within profiles.yml that is resolved through environment variables. That can at least push the problem to something more compatible with your company's secrets management solution. This might leave non-Linux admins a bit lost though. The truth is that there isn't really a perfect solution and it depends on what's at stake and what are the most likely problems with each solution.


Seeding Data

Dbt provides the ability to initialize some data from a file. This isn't intended for large data sets but it is useful for small projects and for kickstarting large projects. I'm following this guide here.

Since we don't have any existing data from a company, we can go find some. While AWS has curated some great open datasets, they tend to be several hundred gigabytes up to petabytes in size. Instead, I found one with interesting Spotify data on Kaggle.com.

Dbt init will have created a seeds directory for you where you can place the downloaded and unzipped CSV files. Based on the documentation, dbt may be expecating the file names to fit a lowercase and underscored convention so I have renamed them accordingly.

Handling Issues In the Data

This section isn't strictly about seeding data with dbt but it's a realistic part of data engineering. If you're using another data set, you can safely skip this section.

My first time running "dbt seed" from the project root was unsuccessful. I received a stacktrace like this:

  File "/Users/chris/Code/installations/dbt/dbt-venv/lib/python3.12/site-packages/dbt_common/clients/system.py", line 174, in load_file_contents

    to_return = handle.read().decode("utf-8")

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7250-7251: invalid continuation byte

This doesn't tell me what's going on but I can assume that dbt fails to parse one of the seed data files due to an encoding issue. The stack trace doesn't say what file and the error happens while using a built-in Python function. Increasing the log level does not improve the message:

chris@Mac-mini dbt_demo % dbt --debug seed

Moving files out of the seed directory and reintroducing them one at a time, I was able to determine the issue was in the popular songs dataset. We can use Python to further investigate:

chris@Mac-mini dbt_demo % python3

Python 3.9.6 (default, Feb  3 2024, 15:58:27) 

[Clang 15.0.0 (clang-1500.3.9.4)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> with open("seeds/popular_songs.csv", "r") as file:

...     file.read()

... 

Traceback (most recent call last):

  File "<stdin>", line 2, in <module>

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/codecs.py", line 322, in decode

    (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7250-7251: invalid continuation byte

>>> quit()

So, there's definitely an issue with the encoding in this file for Python. VSCode is more forgiving and will open and display the file but there's missing characters present:

I'm not convinced that the file is simply corrupted because multiple users on Kaggle have used it without issue. Most likely, it's just a different encoding that dbt supports. Since automatically detecting encodings is not an exact science, we are left to take a few educated guesses about what the encoding might be. Since Windows is a common reason to introduce issues with file format (encodings, EOL characters, meta characters, etc.), what happens if we ask Python to use "cp1252" instead?

chris@Mac-mini dbt_demo % python3

Python 3.9.6 (default, Feb  3 2024, 15:58:27) 

[Clang 15.0.0 (clang-1500.3.9.4)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> with open("seeds/popular_songs.csv", "r", encoding="cp1252") as file:

...     file.read()

This dumped the full file contents to the terminal screen instead of an error! Dbt's documentation doesn't share how we can supply an alternate encoding so we'll have to try converting it ourselves.

chris@Mac-mini dbt_demo % python3

Python 3.9.6 (default, Feb  3 2024, 15:58:27) 

[Clang 15.0.0 (clang-1500.3.9.4)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> lines = []

>>> with open("seeds/popular_songs.csv", "r", encoding="cp1252") as file:

...     lines = file.readlines()

... 

>>> with open("popular_songs_utf.csv", "w") as file:

...     file.writelines(lines)

... 

>>> quit()

Then move the non-UTF file out of the seeds directory and re-run dbt seed:

chris@Mac-mini dbt_demo % ls

README.md logs popular_songs_utf.csv snapshots

analyses macros python target

dbt_project.yml models seeds tests

chris@Mac-mini dbt_demo % mv seeds/popular_songs.csv ../

chris@Mac-mini dbt_demo % dbt seed

17:32:45  Running with dbt=1.8.0

17:32:46  Registered adapter: postgres=1.8.0

17:32:46  Found 2 models, 4 data tests, 4 seeds, 413 macros

17:32:46  

17:32:46  

17:32:46  Finished running  in 0 hours 0 minutes and 0.04 seconds (0.04s).

17:32:46  Encountered an error:

Database Error

  connection to server at "localhost" (::1), port 5432 failed: Connection refused

   Is the server running on that host and accepting TCP/IP connections?

  connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused

   Is the server running on that host and accepting TCP/IP connections?

  

Progress!

Getting Seed Data To The Database

Now that that is over, we can get to this error from dbt seed. That error is due to Postgres no longer running. Homebrew users can resume the service:

brew services start postgresql@16

I also ran into some permissions issues which I resolved through pgAdmin4. There's plenty of info about that on the web so I'll avoid posting anything super secret here for now.

After dealing with permission denied errors, I can see some data start to load:

chris@Mac-mini dbt_demo % dbt seed

17:43:00  Running with dbt=1.8.0

17:43:00  Registered adapter: postgres=1.8.0

17:43:00  Found 2 models, 4 seeds, 4 data tests, 413 macros

17:43:00  

17:43:00  Concurrency: 10 threads (target='dev')

17:43:00  

17:43:00  1 of 4 START seed file dbt.artists ............................................. [RUN]

17:43:00  2 of 4 START seed file dbt.popular_songs_utf ................................... [RUN]

17:43:00  3 of 4 START seed file dbt.top_100 ............................................. [RUN]

17:43:00  4 of 4 START seed file dbt.tracks .............................................. [RUN]

17:43:04  3 of 4 OK loaded seed file dbt.top_100 ......................................... [INSERT 100 in 4.29s]

17:43:08  2 of 4 ERROR loading seed file dbt.popular_songs_utf ........................... [ERROR in 7.73s]

I can also see it in pgAdmin4:

The commands hangs, so the issue is unclear. The dataset is not very large, so it should not take a long time. I want more specific information so this time I'll run:

dbt --debug seed --select tracks

This seems to put dbt in a loop, repeatedly attempting inserts to the database:

17:53:36  SQL status: INSERT 0 10000 in 0.0 seconds

17:53:46  Using postgres connection "seed.dbt_demo.tracks"

17:53:46  On seed.dbt_demo.tracks: 

          insert into "dbt"."dbt"."tracks" ("id", "name", "popularity", "duration_ms", "explicit", "artists", "id_artists", "release_date", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature") values

          (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%...

17:53:46  SQL status: INSERT 0 10000 in 0.0 seconds

It looks like dbt is attempting to insert records one by one. I can't tell if it's using the 10 threads I specified or not but this is very slow. Since I don't see data in pgAdmin or any "commit" statements, I will assume that the state will be rolled-back if I don't let dbt finish. It will probably take less time for me to do it myself in Python. I used duckdb because it's very SQL-friendly, it's fast, and it has lots of convenient integrations like CSV and Postgres. If you want to know more about when to use DuckDB, you can read about it on my other post.

import duckdb


duckdb.sql("INSTALL postgres;")

duckdb.sql("ATTACH 'dbname=dbt user=dbt host=<redacted> port=<redacted>' AS dbt (TYPE POSTGRES);")


duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.artists")

duckdb.sql("CREATE TABLE dbt.dbt.artists AS SELECT * FROM 'seeds/artists.csv';")


duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.popular_songs")

duckdb.sql("CREATE TABLE dbt.dbt.popular_songs AS SELECT * FROM 'seeds/popular_songs.csv';")


duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.top_100")

duckdb.sql("CREATE TABLE dbt.dbt.top_100 AS SELECT * FROM 'seeds/top_100.csv';")


duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.tracks")

duckdb.sql("CREATE TABLE dbt.dbt.tracks AS SELECT * FROM 'seeds/tracks.csv';")

Using pgAdmin, we can now confirm that there's data in postgres! 

I will also be moving these files to a new folder simply named data/ to prevent other project members from experiencing errors from my invalid seed files. I'll also be including the data/ directory in .gitignore since it's large and publicly available (for now). You might want something more reliable in a production setting.

Takeaways on Seed Data with dbt

I think seeding data with dbt has some practical usages. Considering that they do not exlude the seed directory and they've been careful about what does and doesn't go into version control and that dbt seed is limited to UTF-8 encoded CSV files, I think it's safe to say this is useful for very small tables that fit easily in a git repo that you or your team owns and maintains itself.

Data Modeling

The core use case of dbt is modeling data so that you can apply transformations iteratively with ease. This process helps you implement a medallion architecture to progressively refine data sets for quality, performance, and relevance so that you can generate accurate and snappy dashboards or build products.

Existing Schema

Before getting to the dbt part, we need to look at our current data set.

SELECT table_name, column_name, data_type

FROM information_schema.columns

WHERE table_schema = 'dbt'

ORDER BY table_name, column_name

The schema for popular_songs continues and the query results also include schema for the top_100, tracks, and my_first_dbt_model tables.

Asking Questions About The Data

At this point, without a goal, there's really nothing to do. So, let's start asking questions about it!

We can start to answer these questions with some queries:

WITH artists_count AS (

SELECT DISTINCT count(a.name) as artist_count, 'artists' AS table_name

FROM dbt.artists AS a

),


popular_songs_count AS (

SELECT DISTINCT count(p."artist(s)_name") as artist_count, 'popular_songs' AS table_name

FROM dbt.popular_songs AS p

),


overlap_count AS (

SELECT DISTINCT count(a.name) as artist_count, 'overlap' AS table_name

FROM dbt.artists AS a

JOIN dbt.popular_songs p

ON a.name = p."artist(s)_name"

)


SELECT *

FROM artists_count

UNION

SELECT *

FROM popular_songs_count

UNION

SELECT *

FROM overlap_count

Output from query on the left:

Examples of overlapping artists:

Straight away, we see that direct overlap of artist names is limited to just 567. We may need to look for inconsistencies in their names across datasets. We can try to conform both datasets a bit:

overlap_count AS (

SELECT DISTINCT count(a.name) as artist_count, 'overlap' AS table_name

FROM dbt.artists AS a

JOIN dbt.popular_songs p

ON LOWER(a.name) = LOWER(p."artist(s)_name")

)

This gets us closer to the total artists (953) in the smaller data set (popular songs). There may be more inconsistencies but for right now, we can mark this as the first of many potential data cleaning steps to perform with dbt.

You may also notice the use of double quotes for the "artist(s)_name" column. This is a mild nuisance that we could eliminate but we should keep in mind the semantics of the name and recognize its self-describing qualities. The column implies that there might be multiple artists and since this is a varchar column, they might be stored as comma-separated values which we can prove just by looking at it:

If we start looking at the other tables, we can see that the "tracks" table supports multiple artists by using an array, which is still represented as a varchar due to our simplistic load script. We can also see that top_100 avoids this problem by either coincidence of all 100 tracks only having a single artist or by only referring to the primary artist.

Cleaning The Data With SQL

By now, we can see another potential contributor to our mismatched counts of distinct artist names. So how do we clean this up? To get to the individual artist names, we're going to need to convert this text into an actual array of some kind. Depending on the text that you're working with and the dialect of SQL, you might be able to parse it directly. In our case, an easy solution is to remove the square brackets and split the remaining comma-separated string into an array.

select *, string_to_array(regexp_replace(artists, '\[|\]|''|""', '', 'g'), ',') as artists_json

from dbt.tracks

This gives our analysts the ability to use Postgres array functions directly without first trying to parse and convert the column. 

Creating A Model In dbt

We have a lot more work to do but we can use this to show of creation of a new dbt model by adding this SQL query to a file in our project under models/example/tracks_2.sql 

We also need a reference to this file in our models/example/schema.yml file:


version: 2


models:

 - name: my_first_dbt_model

   description: "A starter dbt model"

   columns:

     - name: id

       description: "The primary key for this table"

       data_tests:

         - unique

         - not_null


 - name: my_second_dbt_model

   description: "A starter dbt model"

   columns:

     - name: id

       description: "The primary key for this table"

       data_tests:

         - unique

         - not_null


 - name: tracks_clean

   description: "A more useful version of tracks table"

When we run "dbt build" or "dbt run" now, we'll see the table in Postgres with the data from the query!

Clean Up

Before moving on, I need to clean up some of my work. I've taken the following steps:

Remaining Columns

Now that we have our table and at least one cleaned up column, we can proceed with other columns and questions about the data.

We can update our model to capture these improvements:

{{ config(materialized='table') }}


select id,

   "name",

   cast(popularity as integer),

   duration_ms,

   case when "explicit" = 0 then FALSE

     when "explicit" = 1 then TRUE

     else null

     end as "explicit",

   string_to_array(regexp_replace(artists, '\[|\]|''|""', '', 'g'), ',') as artists_json,

   string_to_array(regexp_replace(id_artists, '\[|\]|''|""', '', 'g'), ',') as id_artists_json,

   to_date(release_date,'YYYY-MM-DD') as release_date,

   danceability,

   energy,

   cast(key as integer),

   loudness,

   mode,

   speechiness,

   acousticness,

   instrumentalness,

   liveness,

   valence,

   tempo as beats_per_minute,

   time_signature

from raw.tracks

Data Quality & Testing

As a musician, I have a lot of new questions about the usefulness of the data:

I will leave those questions for now. Many of them are deeply complex to answer and might be best left for data scientists and analysts to explore. There are also more tables to clean!

Modeling Other Tables

The top 100 table requires similar data type changes for convenience like the tracks and artists tables:

Although the popular songs table shows some inconsistencies with other tables, those inconsistencies help us answer some questions.

Dealing With Key And Mode

Our original values for key and mode were numeric. While numeric values might be useful for machine learning, they still aren't normalized between 0-1 so they're not that helpful. More importantly, it's not obvious how the values map. Does 0 equal A because most people think alphabetically or does 0 equal C because that's a common and simple key in music theory? Major and minor keys don't even make sense to order using the alphabet or music theory.

If there's a perfect correlation between the categorical and numeric values across tables, we might be able to work with an assumed mapping and confirm it with the data provider later. If there's no correlation at all, we should not attempt to map them.

Exploring the data yields sub-par results. Joining across song title, artist, and release_date should provide sufficient guarantee of comparing the same version of the same song. There's a fairly strong connection between the values of mode (major to 1 and minor to 0) but the result set is just over 100 songs so the 2 non-conforming instances make up 1-2% of the results which isn't great. The outcome is even less predictable for keys, so perhaps for now it's best to omit these fields from our data set to maintain the trustworthiness of our data.

Testing With Dbt

This is a good time to introduce dbt's testing feature and see how it can help us. .

Running Tests

With dbt, we can run tests just be typing:

chris@Mac-mini dbt_demo % dbt test

This will run all of the test cases in our project's test suite and display the results:

chris@Mac-mini dbt_demo % dbt test

18:41:19  Running with dbt=1.8.0

18:41:20  Registered adapter: postgres=1.8.0

18:41:20  Found 6 models, 4 data tests, 413 macros

18:41:20  

18:41:20  Concurrency: 10 threads (target='dev')

18:41:20  

18:41:20  1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]

18:41:20  2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]

18:41:20  3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]

18:41:20  4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]

18:41:20  1 of 4 FAIL 1 not_null_my_first_dbt_model_id ................................... [FAIL 1 in 0.04s]

18:41:20  2 of 4 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.04s]

18:41:20  4 of 4 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.04s]

18:41:20  3 of 4 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]

18:41:20  

18:41:20  Finished running 4 data tests in 0 hours 0 minutes and 0.14 seconds (0.14s).

18:41:20  

18:41:20  Completed with 1 error and 0 warnings:

18:41:20  

18:41:20  Failure in test not_null_my_first_dbt_model_id (models/example/schema.yml)

18:41:20    Got 1 result, configured to fail if != 0

18:41:20  

18:41:20    compiled code at target/compiled/dbt_demo/models/example/schema.yml/not_null_my_first_dbt_model_id.sql

18:41:20  

18:41:20  Done. PASS=3 WARN=0 ERROR=1 SKIP=0 TOTAL=4

From here, we can see that there's a failing test related to the default table my_first_dbt_model.  While there are no tests in the "tests/" directory yet, we do have some checks defined in "models/schema.yml":

version: 2


models:

 - name: my_first_dbt_model

   description: "A starter dbt model"

   columns:

     - name: id

       description: "The primary key for this table"

       data_tests:

         - unique

         - not_null


 - name: my_second_dbt_model

   description: "A starter dbt model"

   columns:

     - name: id

       description: "The primary key for this table"

       data_tests:

         - unique

         - not_null


 - name: tracks

   description: "A more useful version of raw.tracks table"


 - name: artists

   description: "A more useful version of raw.artists table"


 - name: top_100

   description: "A more useful version of raw.top_100 table"

The table "my_first_dbt_model" is defined in "models/my_first_dbt_model.sql" by a query that generates both valid and invalid records and includes a commented WHERE statement to remove the null id records and make the test pass. Here's an abbreviated version without any comments or config:

with source_data as (


   select 1 as id

   union all

   select null as id


)


select *

from source_data

where id is not null

We can confirm that our test works now:

chris@Mac-mini dbt_demo % dbt run

18:48:43  Running with dbt=1.8.0

18:48:43  Registered adapter: postgres=1.8.0

18:48:43  Found 6 models, 4 data tests, 413 macros

18:48:43  

18:48:43  Concurrency: 10 threads (target='dev')

18:48:43  

18:48:43  1 of 6 START sql table model dbt.artists ....................................... [RUN]

18:48:43  2 of 6 START sql table model dbt.my_first_dbt_model ............................ [RUN]

18:48:43  3 of 6 START sql table model dbt.popular_songs ................................. [RUN]

18:48:43  4 of 6 START sql table model dbt.top_100 ....................................... [RUN]

18:48:43  5 of 6 START sql table model dbt.tracks ........................................ [RUN]

18:48:43  2 of 6 OK created sql table model dbt.my_first_dbt_model ....................... [SELECT 1 in 0.10s]

18:48:43  3 of 6 OK created sql table model dbt.popular_songs ............................ [SELECT 953 in 0.10s]

18:48:43  4 of 6 OK created sql table model dbt.top_100 .................................. [SELECT 100 in 0.09s]

18:48:43  6 of 6 START sql view model dbt.my_second_dbt_model ............................ [RUN]

18:48:43  6 of 6 OK created sql view model dbt.my_second_dbt_model ....................... [CREATE VIEW in 0.03s]

18:48:49  1 of 6 OK created sql table model dbt.artists .................................. [SELECT 1104349 in 6.18s]

18:48:51  5 of 6 OK created sql table model dbt.tracks ................................... [SELECT 586672 in 7.78s]

18:48:51  

18:48:51  Finished running 5 table models, 1 view model in 0 hours 0 minutes and 7.87 seconds (7.87s).

18:48:51  

18:48:51  Completed successfully

18:48:51  

18:48:51  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

chris@Mac-mini dbt_demo % dbt test

18:48:53  Running with dbt=1.8.0

18:48:53  Registered adapter: postgres=1.8.0

18:48:53  Found 6 models, 4 data tests, 413 macros

18:48:53  

18:48:53  Concurrency: 10 threads (target='dev')

18:48:53  

18:48:53  1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]

18:48:53  2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]

18:48:53  3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]

18:48:53  4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]

18:48:53  3 of 4 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]

18:48:53  2 of 4 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.05s]

18:48:53  1 of 4 PASS not_null_my_first_dbt_model_id ..................................... [PASS in 0.05s]

18:48:53  4 of 4 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.05s]

18:48:53  

18:48:53  Finished running 4 data tests in 0 hours 0 minutes and 0.12 seconds (0.12s).

18:48:53  

18:48:53  Completed successfully

18:48:53  

18:48:53  Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

I will leave our table this way now, since it's just a starter table and we will eventually delete it.

Choosing Test Cases

There are lots of ways to choose or create new test cases. Test cases validate expectations of the data that aren't implicitly governed in the table in some other way. This is useful when data types don't cut it and features like constraints aren't available in your database or would introduce undesirable performance or portability trade-offs.

Usually, requirements analysis, data exploration, and defect resolution provide obvious test case scenarios. For now, we might need to pick a few at random.

Creating A Test Case

Earlier we noticed that the more subjective columns like acousticness and valence may all be percentages. We can enforce that expectation with tests.

I'll add some code to schema.yml to capture this expectation based on another test I found online:

 - name: popular_songs

   description: "A more useful version of raw.popular_songs table"

   columns:

     - name: valence_percent

       tests:

         - dbt_utils.accepted_range:

             min_value: 0

             max_value: 100

             inclusive: true

This doesn't work though:

chris@Mac-mini dbt_demo % dbt test

19:01:25  Running with dbt=1.8.0

19:01:25  Registered adapter: postgres=1.8.0

19:01:25  [WARNING]: Deprecated functionality

The `tests` config has been renamed to `data_tests`. Please see

https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more

information.

19:01:25  Found 6 models, 5 data tests, 413 macros

19:01:25  

19:01:25  Concurrency: 10 threads (target='dev')

19:01:25  

19:01:25  1 of 5 START test dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0  [RUN]

19:01:25  2 of 5 START test not_null_my_first_dbt_model_id ............................... [RUN]

19:01:25  3 of 5 START test not_null_my_second_dbt_model_id .............................. [RUN]

19:01:25  4 of 5 START test unique_my_first_dbt_model_id ................................. [RUN]

19:01:25  5 of 5 START test unique_my_second_dbt_model_id ................................ [RUN]

19:01:26  1 of 5 ERROR dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0  [ERROR in 0.03s]

19:01:26  2 of 5 PASS not_null_my_first_dbt_model_id ..................................... [PASS in 0.04s]

19:01:26  3 of 5 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.04s]

19:01:26  4 of 5 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]

19:01:26  5 of 5 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.05s]

19:01:26  

19:01:26  Finished running 5 data tests in 0 hours 0 minutes and 0.13 seconds (0.13s).

19:01:26  

19:01:26  Completed with 1 error and 0 warnings:

19:01:26  

19:01:26    Compilation Error in test dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0 (models/example/schema.yml)

  'dbt_utils' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

19:01:26  

19:01:26  Done. PASS=4 WARN=0 ERROR=1 SKIP=0 TOTAL=5

This turns out to be due to the dbt package "dbt_utils" missing.

Installing dbt Packages

Dbt has its own package manager and some interesting extensions to the dbt-core tools. To use it, you need to create a new config file in the project root:

chris@Mac-mini dbt_demo % touch packages.yml 

Then add your dependencies to it:

packages:

 - package: dbt-labs/dbt_utils

   version: 1.1.1

Now we tell dbt to update the project with the new config:

chris@Mac-mini dbt_demo % dbt deps

19:04:40  Running with dbt=1.8.0

19:04:40  Updating lock file in file path: /Users/chris/Code/dbt_demo/package-lock.yml

19:04:40  Installing dbt-labs/dbt_utils

19:04:40  Installed from version 1.1.1

19:04:40  Up to date!

And now our tests can run:

chris@Mac-mini dbt_demo % dbt test

19:04:45  Running with dbt=1.8.0

19:04:45  Registered adapter: postgres=1.8.0

19:04:45  Unable to do partial parsing because a project dependency has been added

19:04:46  [WARNING]: Deprecated functionality

The `tests` config has been renamed to `data_tests`. Please see

https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more

information.

19:04:46  Found 6 models, 5 data tests, 527 macros

19:04:46  

19:04:46  Concurrency: 10 threads (target='dev')

19:04:46  

19:04:46  1 of 5 START test dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0  [RUN]

19:04:46  2 of 5 START test not_null_my_first_dbt_model_id ............................... [RUN]

19:04:46  3 of 5 START test not_null_my_second_dbt_model_id .............................. [RUN]

19:04:46  4 of 5 START test unique_my_first_dbt_model_id ................................. [RUN]

19:04:46  5 of 5 START test unique_my_second_dbt_model_id ................................ [RUN]

19:04:46  2 of 5 PASS not_null_my_first_dbt_model_id ..................................... [PASS in 0.04s]

19:04:46  1 of 5 PASS dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0  [PASS in 0.05s]

19:04:46  5 of 5 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.04s]

19:04:46  3 of 5 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.04s]

19:04:46  4 of 5 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]

19:04:46  

19:04:46  Finished running 5 data tests in 0 hours 0 minutes and 0.13 seconds (0.13s).

19:04:46  

19:04:46  Completed successfully

19:04:46  

19:04:46  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Test Wrap Up

We wrote new data tests, confirmed an expectation of our data that will be continually validated, and installed some new packages along the way. There's a lot left to test and new packages to explore like dbt-expectations inspired by the popular data testing tool Great Expectations

Stay Tuned!

That's all for now but I'll be sure to have more soon. Send me messages and tell me what I'm doing right or wrong!

Here's some topics I want to get into:

Did you find this useful?