Starting With dbt
Started: May 17, 2024
Last Updated: June 5, 2024
Why I'm Interested In dbt
I'm writing about dbt because it's everywhere. One job description after another lists "dbt" as a requirement. I see what it does on paper: version control, CI/CD, testing, docs, dependency management. I've pushed for all of these things as a software engineer and dbt claims to do this for analysts. As a data engineer, I have seen that it can be challenging to manage data quality and lineage as your company builds table on top of table on top of table in medallion architecture, so maybe there is something here.
But, I have sooo many questions:
Do I actually need dbt to do these things? And if not, does it make it substantively better?
How does this CLI tool that I run locally process data at scale? Or, if it delegates to another engine, how does it know that it works?
How does dbt handle fundamental gaps & challenges with software testing like concisely capturing expectations without doubling the work of the developer or its data and distributed systems challenges of realistic test data or performance & integration testing?
Does dbt enable a more balanced distribution of responsibility around data quality that frees up data engineers to do more impactful work?
Why do so many data engineering job descriptions look for dbt if it's intended for analysts and non-engineers?
My goal in this article (series?) will be to answer those questions, answer any questions you might have, and basically get up to speed while documenting my journey.
What is dbt?
Data Build Tool helps analysts work like engineers to enable make common data engineering tasks easy to manage. There's lots of information out there on dbt's site and on Youtube. In that video, Carly Kaufman attributes the coining of the phrase "analytics engineering" to dbt's philosophy of enabling analysts to work like engineers. Specifically, dbt aims to provide better version control and CI/CD, testing and documentation, and dependency management. It's also increasingly common for job descriptions for data engineers. There's some huge clients on their site: BHP, Condé Nast, Domain, HubSpot, jetBlue, Nasdaq, Vestas, Sunrun, code42, and McDonald's Nordics.
How dbt Positions Themselves
This photo from dbt's site helps position it within a set of tools commonly used by their customers. It helps people know what integrations exist and what tools it does not replace.
How I Think About dbt
I want to add more color for the data engineers and analysts. The core flow for dbt is to write some SELECT queries and YAML configs, check it into VCS, and let a scheduler run dbt to wrap and execute your queries as DDL/DML in a database.
Another Great Resource
If you're in a "listen to talks and sip coffee while writing code", there's a great video featuring Carly Kaufman from dbt Labs (formerly Fishtown Analytics) does an amazing job explaining the motivations, use cases, and core concepts.
Installation
Finding the Right Guide
At first, I chose to follow dbt's getting started guide for GCP (because I also want to learn more about GCP and it's becoming ever more popular for ML work). There are several other guides as well for Azure Synapse, Databricks, Microsoft Fabric, Redshift, Snowflake, and Starburst. These aren't the only data platforms that are compatible with dbt though, there's trusted adapters (Spark, Athena, Glue, Dremio, Postgres) and community adapters (ClickHouse, DuckDB, Hive, MySQL, Upsolver) too.
When I saw that the first steps were to make an account, I was pretty sure I took a wrong turn somewhere. And yes, that wrong step was not realizing that "dbt cloud" is of course, dbt Lab's SaaS offering while "dbt core" is an open source tool. In fact, there's a whole page on their site about it. Silly me.
Now then, dbt's site has steered me to a guide for using dbt core with GitHub Codespaces but they have another for pip and Docker users as well as source installations. The Codespaces approach might save me 10 minutes but I don't always use Github for work. All three other options are readily available on a local laptop though and might give me more flexibility for my repo layout.
Installing dbt
Proceeding with the pip guide, there's only two major steps:
Setup a virtual environment (recommended)
Install a dbt adapter for your project
Following the guide, you can just copy paste instructions but if you want my take on their approach, keep reading. Please keep in mind that these code snippets are specific to my machine and goals.
Virtual Environment
The dbt guide recommends creating a virtual environment to install dbt and I agree. They do also suggest configuring a global alias to activate the environment. I think this second step is more a matter of preference. You'll need to run "env_dbt" every time you open a terminal before you can use dbt so if you do end up with more dbt installations for some reason, you will need to create separate aliases (or just use "source" directly or in a script).
python -m venv dbt-env # create the environment
vim ~.bash_profile
Add an alias to the file:
alias env_dbt='source /Users/Chris/Code/CloudDemo/dbt-env/bin/activate'
Type ":wq" to save and quit.
source !$ # Update shell with new profile
env_dbt # Activate the environment
Later on I find that organizing your project is a little weird. It seems like this environment directory should probably be separate from your project entirely. If you don't, it's not a big deal but you might want to make sure you update your .gitignore file to exclude the venv.
Adapter
The dbt guide shows how to install dbt-core and an adapter dbt-postgres together. I'll stick with Postgres too, for now. It's familiar and lightweight and lets me explore dbt right away without dealing with GCP just yet.
Since dbt-postgres depends on dbt-core, one command installs both. Assuming that you still have the dbt-env environment activated, this will put your dbt packages in dbt-env/lib/python<version>/site-packages/ instead of your system Python's lib path.
python -m pip install dbt-core dbt-postgres # Install dbt with pg adapter
Connecting Datasources
Installing Postgres
Before we can connect, we must make sure Postgres is installed and that the server is running. The postgres download site shares a link to Postgres.app which might be easier to manage for non-admin readers. I will be using a previously installed and initialized server. I'm also using pgAdmin as a diagnostic aide.
Connecting With dbt
I'm using the dbt postgres connection guide if you want to follow along. Since we're using dbt-core, not dbt-cloud, we need to create and update dbt's config file profiles.yml. However, there's more setup for our dbt project required and creating profiles.yml now will potentially interfere with that. Not to mention, I have no idea where it goes.
Initializing Your dbt Project
Initializing a dbt project is seemingly easy. Just type dbt init and follow the interactive shell to configure your project. You can't name it "dbt" and you cannot use hyphens "-" which is a bit annoying. You will also generate a "logs" directory. If dbt init ran successfully, you can delete this log directory. Future dbt commands need to be run within the generated directory and will also produce log directories that you should list in your .gitignore. Also, if you configure a connection profile (e.g. postgres) the credentials will be stored in "~/.dbt/profiles.yml". The security considerations may or may not be important depending on your company's threat model.
Logging
Now is a decent time to mention dbt's logging. Whenever you run dbt, it will log to files. You can change the default log level or override it with each command. You can also change the log location. Typically, if you are in a dbt project, logs go to the project's log directory which is usually <project root>/logs/ but outside of a dbt project, it's the current working directory (ie '.'). You can learn more here: https://docs.getdbt.com/reference/global-configs/logs
Testing The Connection
Ultimately, we want to be able to build, test, and run our dbt project.
It's time to "dbt run", right? No, it's not:
(dbt-env) [Chris@/Users/Chris/Code/CloudDemo]$ dbt build
15:45:35 Running with dbt=1.8.0
15:45:35 Encountered an error:
Runtime Error
No dbt_project.yml found at expected path /Users/Chris/Code/CloudDemo/dbt_project.yml
Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml
It's time to "dbt build"! Almost:
(dbt-env) [Chris@/Users/Chris/Code/CloudDemo/dbt_demo]$ dbt build
15:45:46 Running with dbt=1.8.0
15:45:47 Registered adapter: postgres=1.8.0
15:45:47 Unable to do partial parsing because saved manifest not found. Starting full parse.
15:45:48 Found 2 models, 4 data tests, 413 macros
15:45:48
15:45:49 Concurrency: 3 threads (target='dev')
15:45:49
15:45:49 1 of 6 START sql table model dbt.my_first_dbt_model ............................ [RUN]
15:45:49 1 of 6 OK created sql table model dbt.my_first_dbt_model ....................... [SELECT 2 in 0.25s]
15:45:49 2 of 6 START test not_null_my_first_dbt_model_id ............................... [RUN]
15:45:49 3 of 6 START test unique_my_first_dbt_model_id ................................. [RUN]
15:45:49 2 of 6 FAIL 1 not_null_my_first_dbt_model_id ................................... [FAIL 1 in 0.08s]
15:45:49 3 of 6 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.08s]
15:45:49 4 of 6 SKIP relation dbt.my_second_dbt_model ................................... [SKIP]
15:45:49 5 of 6 SKIP test not_null_my_second_dbt_model_id ............................... [SKIP]
15:45:49 6 of 6 SKIP test unique_my_second_dbt_model_id ................................. [SKIP]
15:45:49
15:45:49 Finished running 1 table model, 4 data tests, 1 view model in 0 hours 0 minutes and 0.64 seconds (0.64s).
15:45:49
15:45:49 Completed with 1 error and 0 warnings:
15:45:49
15:45:49 Failure in test not_null_my_first_dbt_model_id (models/example/schema.yml)
15:45:49 Got 1 result, configured to fail if != 0
15:45:49
15:45:49 compiled code at target/compiled/dbt_demo/models/example/schema.yml/not_null_my_first_dbt_model_id.sql
15:45:49
15:45:49 Done. PASS=2 WARN=0 ERROR=1 SKIP=3 TOTAL=6
This is actually a good error. It failed because we haven't made any changes to our dbt models but you can see in the logs and in pgAdmin that it's created "my_first_dbt_model" which is amazing!
While your project is not passing tests, the build still succeeds and you can now also call "dbt run".
Creating Documentation
It's worth briefly touching on dbt's documentation functionality before getting into models because it's easy to overlook docs but this is a really valuable feature for data quality and governance that can help a lot of companies hold out over paid data cataloging and lineage solutions like Stemma and Monte Carlo until they are ready to make the investment. At the very least, this provides a diagnostic tool for developers.
From your dbt folder, you just need to run a few commands:
dbt docs generate
dbt docs serve
This should open automatically in your browser or you can go to http://localhost:8080.
Version Control
This is one of dbt's claims to fame. Specifically, the claim is that dbt enables you to version control the state of your data warehouse. So, I set up a Github repository and started committing code. If you are not familiar with version control, you should ask the folks at github.com. Another related claim is CI/CD to which we will return later but first we should talk about security.
Security Considerations!
Since this is a demo project, I didn't think much of blindly committing code. However, if you're doing this for work, you should be more careful.
Remember our search for profiles.yml and how we found that "dbt init" handled everything for us? That's when we should have started asking "where are my credentials stored?" Or perhaps, dbt should better inform users about what they do with your credentials. While "dbt init" does obscure your password in its interactive shell script and does store it outside of your project so you don't accidentally commit it to version control or otherwise proliferate it.
However, it's important to note that it is written to ~/.dbt/profiles.yml in plaintext by default. Protecting access to this file and being careful about how you read, copy, or view it is critical. You can inject the password via a Jinja template expression within profiles.yml that is resolved through environment variables. That can at least push the problem to something more compatible with your company's secrets management solution. This might leave non-Linux admins a bit lost though. The truth is that there isn't really a perfect solution and it depends on what's at stake and what are the most likely problems with each solution.
Seeding Data
Dbt provides the ability to initialize some data from a file. This isn't intended for large data sets but it is useful for small projects and for kickstarting large projects. I'm following this guide here.
Since we don't have any existing data from a company, we can go find some. While AWS has curated some great open datasets, they tend to be several hundred gigabytes up to petabytes in size. Instead, I found one with interesting Spotify data on Kaggle.com.
Dbt init will have created a seeds directory for you where you can place the downloaded and unzipped CSV files. Based on the documentation, dbt may be expecating the file names to fit a lowercase and underscored convention so I have renamed them accordingly.
Handling Issues In the Data
This section isn't strictly about seeding data with dbt but it's a realistic part of data engineering. If you're using another data set, you can safely skip this section.
My first time running "dbt seed" from the project root was unsuccessful. I received a stacktrace like this:
File "/Users/chris/Code/installations/dbt/dbt-venv/lib/python3.12/site-packages/dbt_common/clients/system.py", line 174, in load_file_contents
to_return = handle.read().decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7250-7251: invalid continuation byte
This doesn't tell me what's going on but I can assume that dbt fails to parse one of the seed data files due to an encoding issue. The stack trace doesn't say what file and the error happens while using a built-in Python function. Increasing the log level does not improve the message:
chris@Mac-mini dbt_demo % dbt --debug seed
Moving files out of the seed directory and reintroducing them one at a time, I was able to determine the issue was in the popular songs dataset. We can use Python to further investigate:
chris@Mac-mini dbt_demo % python3
Python 3.9.6 (default, Feb 3 2024, 15:58:27)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("seeds/popular_songs.csv", "r") as file:
... file.read()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7250-7251: invalid continuation byte
>>> quit()
So, there's definitely an issue with the encoding in this file for Python. VSCode is more forgiving and will open and display the file but there's missing characters present:
I'm not convinced that the file is simply corrupted because multiple users on Kaggle have used it without issue. Most likely, it's just a different encoding that dbt supports. Since automatically detecting encodings is not an exact science, we are left to take a few educated guesses about what the encoding might be. Since Windows is a common reason to introduce issues with file format (encodings, EOL characters, meta characters, etc.), what happens if we ask Python to use "cp1252" instead?
chris@Mac-mini dbt_demo % python3
Python 3.9.6 (default, Feb 3 2024, 15:58:27)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("seeds/popular_songs.csv", "r", encoding="cp1252") as file:
... file.read()
This dumped the full file contents to the terminal screen instead of an error! Dbt's documentation doesn't share how we can supply an alternate encoding so we'll have to try converting it ourselves.
chris@Mac-mini dbt_demo % python3
Python 3.9.6 (default, Feb 3 2024, 15:58:27)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> lines = []
>>> with open("seeds/popular_songs.csv", "r", encoding="cp1252") as file:
... lines = file.readlines()
...
>>> with open("popular_songs_utf.csv", "w") as file:
... file.writelines(lines)
...
>>> quit()
Then move the non-UTF file out of the seeds directory and re-run dbt seed:
chris@Mac-mini dbt_demo % ls
README.md logs popular_songs_utf.csv snapshots
analyses macros python target
dbt_project.yml models seeds tests
chris@Mac-mini dbt_demo % mv seeds/popular_songs.csv ../
chris@Mac-mini dbt_demo % dbt seed
17:32:45 Running with dbt=1.8.0
17:32:46 Registered adapter: postgres=1.8.0
17:32:46 Found 2 models, 4 data tests, 4 seeds, 413 macros
17:32:46
17:32:46
17:32:46 Finished running in 0 hours 0 minutes and 0.04 seconds (0.04s).
17:32:46 Encountered an error:
Database Error
connection to server at "localhost" (::1), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
Progress!
Getting Seed Data To The Database
Now that that is over, we can get to this error from dbt seed. That error is due to Postgres no longer running. Homebrew users can resume the service:
brew services start postgresql@16
I also ran into some permissions issues which I resolved through pgAdmin4. There's plenty of info about that on the web so I'll avoid posting anything super secret here for now.
After dealing with permission denied errors, I can see some data start to load:
chris@Mac-mini dbt_demo % dbt seed
17:43:00 Running with dbt=1.8.0
17:43:00 Registered adapter: postgres=1.8.0
17:43:00 Found 2 models, 4 seeds, 4 data tests, 413 macros
17:43:00
17:43:00 Concurrency: 10 threads (target='dev')
17:43:00
17:43:00 1 of 4 START seed file dbt.artists ............................................. [RUN]
17:43:00 2 of 4 START seed file dbt.popular_songs_utf ................................... [RUN]
17:43:00 3 of 4 START seed file dbt.top_100 ............................................. [RUN]
17:43:00 4 of 4 START seed file dbt.tracks .............................................. [RUN]
17:43:04 3 of 4 OK loaded seed file dbt.top_100 ......................................... [INSERT 100 in 4.29s]
17:43:08 2 of 4 ERROR loading seed file dbt.popular_songs_utf ........................... [ERROR in 7.73s]
I can also see it in pgAdmin4:
The commands hangs, so the issue is unclear. The dataset is not very large, so it should not take a long time. I want more specific information so this time I'll run:
dbt --debug seed --select tracks
This seems to put dbt in a loop, repeatedly attempting inserts to the database:
17:53:36 SQL status: INSERT 0 10000 in 0.0 seconds
17:53:46 Using postgres connection "seed.dbt_demo.tracks"
17:53:46 On seed.dbt_demo.tracks:
insert into "dbt"."dbt"."tracks" ("id", "name", "popularity", "duration_ms", "explicit", "artists", "id_artists", "release_date", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature") values
(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%...
17:53:46 SQL status: INSERT 0 10000 in 0.0 seconds
It looks like dbt is attempting to insert records one by one. I can't tell if it's using the 10 threads I specified or not but this is very slow. Since I don't see data in pgAdmin or any "commit" statements, I will assume that the state will be rolled-back if I don't let dbt finish. It will probably take less time for me to do it myself in Python. I used duckdb because it's very SQL-friendly, it's fast, and it has lots of convenient integrations like CSV and Postgres. If you want to know more about when to use DuckDB, you can read about it on my other post.
import duckdb
duckdb.sql("INSTALL postgres;")
duckdb.sql("ATTACH 'dbname=dbt user=dbt host=<redacted> port=<redacted>' AS dbt (TYPE POSTGRES);")
duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.artists")
duckdb.sql("CREATE TABLE dbt.dbt.artists AS SELECT * FROM 'seeds/artists.csv';")
duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.popular_songs")
duckdb.sql("CREATE TABLE dbt.dbt.popular_songs AS SELECT * FROM 'seeds/popular_songs.csv';")
duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.top_100")
duckdb.sql("CREATE TABLE dbt.dbt.top_100 AS SELECT * FROM 'seeds/top_100.csv';")
duckdb.sql("DROP TABLE IF EXISTS dbt.dbt.tracks")
duckdb.sql("CREATE TABLE dbt.dbt.tracks AS SELECT * FROM 'seeds/tracks.csv';")
Using pgAdmin, we can now confirm that there's data in postgres!
I will also be moving these files to a new folder simply named data/ to prevent other project members from experiencing errors from my invalid seed files. I'll also be including the data/ directory in .gitignore since it's large and publicly available (for now). You might want something more reliable in a production setting.
Takeaways on Seed Data with dbt
I think seeding data with dbt has some practical usages. Considering that they do not exlude the seed directory and they've been careful about what does and doesn't go into version control and that dbt seed is limited to UTF-8 encoded CSV files, I think it's safe to say this is useful for very small tables that fit easily in a git repo that you or your team owns and maintains itself.
Data Modeling
The core use case of dbt is modeling data so that you can apply transformations iteratively with ease. This process helps you implement a medallion architecture to progressively refine data sets for quality, performance, and relevance so that you can generate accurate and snappy dashboards or build products.
Existing Schema
Before getting to the dbt part, we need to look at our current data set.
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'dbt'
ORDER BY table_name, column_name
The schema for popular_songs continues and the query results also include schema for the top_100, tracks, and my_first_dbt_model tables.
Asking Questions About The Data
At this point, without a goal, there's really nothing to do. So, let's start asking questions about it!
Are all artists equally represented across the tables?
Will we have data for an artist in one but not another?
Are there names and id's consistent across tables?
We can start to answer these questions with some queries:
WITH artists_count AS (
SELECT DISTINCT count(a.name) as artist_count, 'artists' AS table_name
FROM dbt.artists AS a
),
popular_songs_count AS (
SELECT DISTINCT count(p."artist(s)_name") as artist_count, 'popular_songs' AS table_name
FROM dbt.popular_songs AS p
),
overlap_count AS (
SELECT DISTINCT count(a.name) as artist_count, 'overlap' AS table_name
FROM dbt.artists AS a
JOIN dbt.popular_songs p
ON a.name = p."artist(s)_name"
)
SELECT *
FROM artists_count
UNION
SELECT *
FROM popular_songs_count
UNION
SELECT *
FROM overlap_count
Output from query on the left:
Examples of overlapping artists:
Straight away, we see that direct overlap of artist names is limited to just 567. We may need to look for inconsistencies in their names across datasets. We can try to conform both datasets a bit:
overlap_count AS (
SELECT DISTINCT count(a.name) as artist_count, 'overlap' AS table_name
FROM dbt.artists AS a
JOIN dbt.popular_songs p
ON LOWER(a.name) = LOWER(p."artist(s)_name")
)
This gets us closer to the total artists (953) in the smaller data set (popular songs). There may be more inconsistencies but for right now, we can mark this as the first of many potential data cleaning steps to perform with dbt.
You may also notice the use of double quotes for the "artist(s)_name" column. This is a mild nuisance that we could eliminate but we should keep in mind the semantics of the name and recognize its self-describing qualities. The column implies that there might be multiple artists and since this is a varchar column, they might be stored as comma-separated values which we can prove just by looking at it:
If we start looking at the other tables, we can see that the "tracks" table supports multiple artists by using an array, which is still represented as a varchar due to our simplistic load script. We can also see that top_100 avoids this problem by either coincidence of all 100 tracks only having a single artist or by only referring to the primary artist.
Cleaning The Data With SQL
By now, we can see another potential contributor to our mismatched counts of distinct artist names. So how do we clean this up? To get to the individual artist names, we're going to need to convert this text into an actual array of some kind. Depending on the text that you're working with and the dialect of SQL, you might be able to parse it directly. In our case, an easy solution is to remove the square brackets and split the remaining comma-separated string into an array.
select *, string_to_array(regexp_replace(artists, '\[|\]|''|""', '', 'g'), ',') as artists_json
from dbt.tracks
This gives our analysts the ability to use Postgres array functions directly without first trying to parse and convert the column.
Creating A Model In dbt
We have a lot more work to do but we can use this to show of creation of a new dbt model by adding this SQL query to a file in our project under models/example/tracks_2.sql
We also need a reference to this file in our models/example/schema.yml file:
version: 2
models:
- name: my_first_dbt_model
description: "A starter dbt model"
columns:
- name: id
description: "The primary key for this table"
data_tests:
- unique
- not_null
- name: my_second_dbt_model
description: "A starter dbt model"
columns:
- name: id
description: "The primary key for this table"
data_tests:
- unique
- not_null
- name: tracks_clean
description: "A more useful version of tracks table"
When we run "dbt build" or "dbt run" now, we'll see the table in Postgres with the data from the query!
Clean Up
Before moving on, I need to clean up some of my work. I've taken the following steps:
Removed all of my tables and created a new schema called "raw"
Updated Python scripts to populate Kaggle tables to the "raw" schema
Updated the tracks model SQL file and schema.yml so that the table is named "tracks" instead of "tracks_2"
Remaining Columns
Now that we have our table and at least one cleaned up column, we can proceed with other columns and questions about the data.
Does popularity need to be a bigint? The answer is no. A quick query shows the range of values is between 0-100 even though there are thousands of tracks. An integer will suffice.
Should "explicit" be a boolean value? The answer is yes. The input data set only provides 0 and 1 as possible values.
Should we also split the id_artists column? Yep!
Does "key" ok to be a bigint? No, it does not. While numeric, there only seem to be 12 values.
We can update our model to capture these improvements:
{{ config(materialized='table') }}
select id,
"name",
cast(popularity as integer),
duration_ms,
case when "explicit" = 0 then FALSE
when "explicit" = 1 then TRUE
else null
end as "explicit",
string_to_array(regexp_replace(artists, '\[|\]|''|""', '', 'g'), ',') as artists_json,
string_to_array(regexp_replace(id_artists, '\[|\]|''|""', '', 'g'), ',') as id_artists_json,
to_date(release_date,'YYYY-MM-DD') as release_date,
danceability,
energy,
cast(key as integer),
loudness,
mode,
speechiness,
acousticness,
instrumentalness,
liveness,
valence,
tempo as beats_per_minute,
time_signature
from raw.tracks
Data Quality & Testing
As a musician, I have a lot of new questions about the usefulness of the data:
What exactly is "popularity"?
How are subjective values like danceability, speechiness, acousticness, instrumentalness, liveness, and valence measured?
How reliable are the values of key, tempo, and time_signature given that people can be incorrect and these can be hard to measure with computers (especially when they change mid-song)?
What does the "mode" column represent? Is this a musical mode like "Dorian" or is it something else like a system setting?
Is it safe to assume that time_signature represents the "upper numeral" of a traditional Western music time signature? Can we assume the "lower numeral" is always 4 (quarter note)? Do time signatures like 6/8 simplify to 3 and does that affect the analysis? Do time signatures like 12/8 simplify to 6/4 and then to 3 for mathematical simplicity despite it being a compound meter?
What about non-Western music which is made everywhere else in the world? How do time signatures, keys, and those subjective measures apply to to listeners worldwide?
I will leave those questions for now. Many of them are deeply complex to answer and might be best left for data scientists and analysts to explore. There are also more tables to clean!
Modeling Other Tables
The top 100 table requires similar data type changes for convenience like the tracks and artists tables:
Conform data types
Conform column name separators to underscores "_"
Correct typo of "valance" to "valence" to match other tables
Replace column names with more conventional and self-describing values (i.e. "decibels" instead of loudness.dB and "seconds" instead of "length".
Although the popular songs table shows some inconsistencies with other tables, those inconsistencies help us answer some questions.
Artist(s)_name values do not have square brackets but our regex_replace will work just the same
Released date is more precise in this table (includes month and day, in addition to year) but there's not reason to have it split by default so we'll make a date column for it.
bpm is used instead of tempo or beats.per.minute but we'll standardize them all to "beats_per_minute"
Subjective "-ness" columns now have a percent which is likely how the other tables are represented as well. However, the "%" can introduce issues with parsers so we'll replace it with "_percent".
Key and Mode are now text columns where key is something like "C#" and Mode is something like "Major" or "Minor". This answers the mode questions from earlier.
Dealing With Key And Mode
Our original values for key and mode were numeric. While numeric values might be useful for machine learning, they still aren't normalized between 0-1 so they're not that helpful. More importantly, it's not obvious how the values map. Does 0 equal A because most people think alphabetically or does 0 equal C because that's a common and simple key in music theory? Major and minor keys don't even make sense to order using the alphabet or music theory.
If there's a perfect correlation between the categorical and numeric values across tables, we might be able to work with an assumed mapping and confirm it with the data provider later. If there's no correlation at all, we should not attempt to map them.
Exploring the data yields sub-par results. Joining across song title, artist, and release_date should provide sufficient guarantee of comparing the same version of the same song. There's a fairly strong connection between the values of mode (major to 1 and minor to 0) but the result set is just over 100 songs so the 2 non-conforming instances make up 1-2% of the results which isn't great. The outcome is even less predictable for keys, so perhaps for now it's best to omit these fields from our data set to maintain the trustworthiness of our data.
Testing With Dbt
This is a good time to introduce dbt's testing feature and see how it can help us. .
Running Tests
With dbt, we can run tests just be typing:
chris@Mac-mini dbt_demo % dbt test
This will run all of the test cases in our project's test suite and display the results:
chris@Mac-mini dbt_demo % dbt test
18:41:19 Running with dbt=1.8.0
18:41:20 Registered adapter: postgres=1.8.0
18:41:20 Found 6 models, 4 data tests, 413 macros
18:41:20
18:41:20 Concurrency: 10 threads (target='dev')
18:41:20
18:41:20 1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]
18:41:20 2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]
18:41:20 3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]
18:41:20 4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]
18:41:20 1 of 4 FAIL 1 not_null_my_first_dbt_model_id ................................... [FAIL 1 in 0.04s]
18:41:20 2 of 4 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.04s]
18:41:20 4 of 4 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.04s]
18:41:20 3 of 4 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]
18:41:20
18:41:20 Finished running 4 data tests in 0 hours 0 minutes and 0.14 seconds (0.14s).
18:41:20
18:41:20 Completed with 1 error and 0 warnings:
18:41:20
18:41:20 Failure in test not_null_my_first_dbt_model_id (models/example/schema.yml)
18:41:20 Got 1 result, configured to fail if != 0
18:41:20
18:41:20 compiled code at target/compiled/dbt_demo/models/example/schema.yml/not_null_my_first_dbt_model_id.sql
18:41:20
18:41:20 Done. PASS=3 WARN=0 ERROR=1 SKIP=0 TOTAL=4
From here, we can see that there's a failing test related to the default table my_first_dbt_model. While there are no tests in the "tests/" directory yet, we do have some checks defined in "models/schema.yml":
version: 2
models:
- name: my_first_dbt_model
description: "A starter dbt model"
columns:
- name: id
description: "The primary key for this table"
data_tests:
- unique
- not_null
- name: my_second_dbt_model
description: "A starter dbt model"
columns:
- name: id
description: "The primary key for this table"
data_tests:
- unique
- not_null
- name: tracks
description: "A more useful version of raw.tracks table"
- name: artists
description: "A more useful version of raw.artists table"
- name: top_100
description: "A more useful version of raw.top_100 table"
The table "my_first_dbt_model" is defined in "models/my_first_dbt_model.sql" by a query that generates both valid and invalid records and includes a commented WHERE statement to remove the null id records and make the test pass. Here's an abbreviated version without any comments or config:
with source_data as (
select 1 as id
union all
select null as id
)
select *
from source_data
where id is not null
We can confirm that our test works now:
chris@Mac-mini dbt_demo % dbt run
18:48:43 Running with dbt=1.8.0
18:48:43 Registered adapter: postgres=1.8.0
18:48:43 Found 6 models, 4 data tests, 413 macros
18:48:43
18:48:43 Concurrency: 10 threads (target='dev')
18:48:43
18:48:43 1 of 6 START sql table model dbt.artists ....................................... [RUN]
18:48:43 2 of 6 START sql table model dbt.my_first_dbt_model ............................ [RUN]
18:48:43 3 of 6 START sql table model dbt.popular_songs ................................. [RUN]
18:48:43 4 of 6 START sql table model dbt.top_100 ....................................... [RUN]
18:48:43 5 of 6 START sql table model dbt.tracks ........................................ [RUN]
18:48:43 2 of 6 OK created sql table model dbt.my_first_dbt_model ....................... [SELECT 1 in 0.10s]
18:48:43 3 of 6 OK created sql table model dbt.popular_songs ............................ [SELECT 953 in 0.10s]
18:48:43 4 of 6 OK created sql table model dbt.top_100 .................................. [SELECT 100 in 0.09s]
18:48:43 6 of 6 START sql view model dbt.my_second_dbt_model ............................ [RUN]
18:48:43 6 of 6 OK created sql view model dbt.my_second_dbt_model ....................... [CREATE VIEW in 0.03s]
18:48:49 1 of 6 OK created sql table model dbt.artists .................................. [SELECT 1104349 in 6.18s]
18:48:51 5 of 6 OK created sql table model dbt.tracks ................................... [SELECT 586672 in 7.78s]
18:48:51
18:48:51 Finished running 5 table models, 1 view model in 0 hours 0 minutes and 7.87 seconds (7.87s).
18:48:51
18:48:51 Completed successfully
18:48:51
18:48:51 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
chris@Mac-mini dbt_demo % dbt test
18:48:53 Running with dbt=1.8.0
18:48:53 Registered adapter: postgres=1.8.0
18:48:53 Found 6 models, 4 data tests, 413 macros
18:48:53
18:48:53 Concurrency: 10 threads (target='dev')
18:48:53
18:48:53 1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]
18:48:53 2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]
18:48:53 3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]
18:48:53 4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]
18:48:53 3 of 4 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]
18:48:53 2 of 4 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.05s]
18:48:53 1 of 4 PASS not_null_my_first_dbt_model_id ..................................... [PASS in 0.05s]
18:48:53 4 of 4 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.05s]
18:48:53
18:48:53 Finished running 4 data tests in 0 hours 0 minutes and 0.12 seconds (0.12s).
18:48:53
18:48:53 Completed successfully
18:48:53
18:48:53 Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4
I will leave our table this way now, since it's just a starter table and we will eventually delete it.
Choosing Test Cases
There are lots of ways to choose or create new test cases. Test cases validate expectations of the data that aren't implicitly governed in the table in some other way. This is useful when data types don't cut it and features like constraints aren't available in your database or would introduce undesirable performance or portability trade-offs.
Usually, requirements analysis, data exploration, and defect resolution provide obvious test case scenarios. For now, we might need to pick a few at random.
Creating A Test Case
Earlier we noticed that the more subjective columns like acousticness and valence may all be percentages. We can enforce that expectation with tests.
I'll add some code to schema.yml to capture this expectation based on another test I found online:
- name: popular_songs
description: "A more useful version of raw.popular_songs table"
columns:
- name: valence_percent
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 100
inclusive: true
This doesn't work though:
chris@Mac-mini dbt_demo % dbt test
19:01:25 Running with dbt=1.8.0
19:01:25 Registered adapter: postgres=1.8.0
19:01:25 [WARNING]: Deprecated functionality
The `tests` config has been renamed to `data_tests`. Please see
https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more
information.
19:01:25 Found 6 models, 5 data tests, 413 macros
19:01:25
19:01:25 Concurrency: 10 threads (target='dev')
19:01:25
19:01:25 1 of 5 START test dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0 [RUN]
19:01:25 2 of 5 START test not_null_my_first_dbt_model_id ............................... [RUN]
19:01:25 3 of 5 START test not_null_my_second_dbt_model_id .............................. [RUN]
19:01:25 4 of 5 START test unique_my_first_dbt_model_id ................................. [RUN]
19:01:25 5 of 5 START test unique_my_second_dbt_model_id ................................ [RUN]
19:01:26 1 of 5 ERROR dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0 [ERROR in 0.03s]
19:01:26 2 of 5 PASS not_null_my_first_dbt_model_id ..................................... [PASS in 0.04s]
19:01:26 3 of 5 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.04s]
19:01:26 4 of 5 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]
19:01:26 5 of 5 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.05s]
19:01:26
19:01:26 Finished running 5 data tests in 0 hours 0 minutes and 0.13 seconds (0.13s).
19:01:26
19:01:26 Completed with 1 error and 0 warnings:
19:01:26
19:01:26 Compilation Error in test dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0 (models/example/schema.yml)
'dbt_utils' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".
19:01:26
19:01:26 Done. PASS=4 WARN=0 ERROR=1 SKIP=0 TOTAL=5
This turns out to be due to the dbt package "dbt_utils" missing.
Installing dbt Packages
Dbt has its own package manager and some interesting extensions to the dbt-core tools. To use it, you need to create a new config file in the project root:
chris@Mac-mini dbt_demo % touch packages.yml
Then add your dependencies to it:
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
Now we tell dbt to update the project with the new config:
chris@Mac-mini dbt_demo % dbt deps
19:04:40 Running with dbt=1.8.0
19:04:40 Updating lock file in file path: /Users/chris/Code/dbt_demo/package-lock.yml
19:04:40 Installing dbt-labs/dbt_utils
19:04:40 Installed from version 1.1.1
19:04:40 Up to date!
And now our tests can run:
chris@Mac-mini dbt_demo % dbt test
19:04:45 Running with dbt=1.8.0
19:04:45 Registered adapter: postgres=1.8.0
19:04:45 Unable to do partial parsing because a project dependency has been added
19:04:46 [WARNING]: Deprecated functionality
The `tests` config has been renamed to `data_tests`. Please see
https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more
information.
19:04:46 Found 6 models, 5 data tests, 527 macros
19:04:46
19:04:46 Concurrency: 10 threads (target='dev')
19:04:46
19:04:46 1 of 5 START test dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0 [RUN]
19:04:46 2 of 5 START test not_null_my_first_dbt_model_id ............................... [RUN]
19:04:46 3 of 5 START test not_null_my_second_dbt_model_id .............................. [RUN]
19:04:46 4 of 5 START test unique_my_first_dbt_model_id ................................. [RUN]
19:04:46 5 of 5 START test unique_my_second_dbt_model_id ................................ [RUN]
19:04:46 2 of 5 PASS not_null_my_first_dbt_model_id ..................................... [PASS in 0.04s]
19:04:46 1 of 5 PASS dbt_utils_accepted_range_popular_songs_valence_percent__True__100__0 [PASS in 0.05s]
19:04:46 5 of 5 PASS unique_my_second_dbt_model_id ...................................... [PASS in 0.04s]
19:04:46 3 of 5 PASS not_null_my_second_dbt_model_id .................................... [PASS in 0.04s]
19:04:46 4 of 5 PASS unique_my_first_dbt_model_id ....................................... [PASS in 0.04s]
19:04:46
19:04:46 Finished running 5 data tests in 0 hours 0 minutes and 0.13 seconds (0.13s).
19:04:46
19:04:46 Completed successfully
19:04:46
19:04:46 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Test Wrap Up
We wrote new data tests, confirmed an expectation of our data that will be continually validated, and installed some new packages along the way. There's a lot left to test and new packages to explore like dbt-expectations inspired by the popular data testing tool Great Expectations.
Stay Tuned!
That's all for now but I'll be sure to have more soon. Send me messages and tell me what I'm doing right or wrong!
Here's some topics I want to get into:
CI/CD
Performance & Cost Optimization
Tables, Views, and Materialized Views
Semantic Layer & Metrics