When to use DuckDB
May 16, 2024
One Minute Takeaway
Use DuckDB for local development when you want fast and portable SQL abstractions. Scroll down for starter code.
More Reasons
DuckDB was designed as a compiled C++ package specifically for in-memory analytics use cases. This gives it certain advantages over pure Python, other databases, and Pandas.
Anyone who has leaned toward libraries like Pandas and databases like Sqlite probably knows that using pure Python for data processing can be a time-consuming and academic exercise in data structures and algorithms. It can be very efficient but it can also be slow and buggy if you aren't careful. Using SQL is more portable and the declarative nature of SQL encourages concise expression of the data you want, leaving complex procedural logic to the engine.
When it comes to databases, not all lightweight databases are intended for analytical workloads (e.g. processing whole tables rather than a single row). Databases like Postgres or Sqlite require installation and configuration to setup. This can get in the way for those who don't need durability, concurrent access, or authorization controls.
Compared to Pandas, DuckDB provides speed and smaller package sizes at the expense of the maturity and broad feature set of Pandas. While this might seem like a disadvantage at first, once you have tried it out yourself and seen the snappy functionality, you might start to feel that Pandas is bloated.
Test Results
I set up a test to compare performance in DuckDB, Pandasql, and Python built-ins using a small CSV input file (15 records, <1KB). The code reads the data, applies basic filtering and sorting, and saves a new file. DuckDB performs 12x better than Pandas in this scenario, although it takes twice as long as regular Python built-ins. That said, it requires 40% less code than built-ins and took a fraction of the time to write. I'm not testing against other databases here since others have done this already.
Code Samples
DuckDB Code
import duckdb
load_csv_query = "CREATE TABLE my_table AS SELECT * FROM read_csv('data.csv');"
transform_query = """
CREATE TABLE new_table AS
SELECT *
FROM my_table
"""
write_csv_query = "COPY new_table TO 'output.csv' (HEADER, DELIMITER ',');"
duckdb.sql(load_csv_query)
duckdb.sql(transform_query)
duckdb.sql(write_csv_query)
Pandas Code
import pandas
import pandasql
latest_events_query = """
WITH latest AS (
SELECT id, max(timestamp) AS timestamp
FROM events
GROUP BY id
)
SELECT events.*
FROM latest
LEFT JOIN events
ON latest.id = events.id AND latest.timestamp = events.timestamp
ORDER BY events.timestamp
"""
events = pandas.read_csv("events.csv")
latest = pandasql.sqldf(latest_events_query, locals())
pandas.DataFrame(latest).to_csv("pandas_output.csv", header=True, index=False)
Python Built-Ins Code
input_lines = []
with open("events.csv", "r") as file:
input_lines = file.readlines()
header = input_lines.pop(0)
sorted_lines = sorted(input_lines)
latest_events = {}
for line in sorted_lines:
id, _, _, new_timestamp, _ = line.split(",")
if id not in latest_events:
latest_events[id] = line
else:
_, _, _, old_timestamp, _ = latest_events[id].split(",")
if new_timestamp > old_timestamp:
latest_events[id] = line
else:
continue
filtered_lines = []
for line in latest_events.values():
id, title, ownerid, timestamp, payload = line.split(",")
filtered_lines.append(",".join([timestamp, id, title, ownerid, payload]))
output_lines = sorted(filtered_lines)
sorted_output_lines = []
for line in output_lines:
timestamp, id, title, ownerid, payload = line.split(",")
sorted_output_lines.append(",".join([id, title, ownerid, timestamp, payload]))
sorted_output_lines.insert(0, header)
with open("builtins_output.csv", "w") as file:
file.writelines(sorted_output_lines)
Additional Resources
Fivetran: How Fast Is DuckDB Really? - Compares performance of DuckDB to DatabaseX on larger hardware.