When to use DuckDB

May 16, 2024

One Minute Takeaway

Use DuckDB for local development when you want fast and portable SQL abstractions. Scroll down for starter code.

More Reasons

DuckDB was designed as a compiled C++ package specifically for in-memory analytics use cases. This gives it certain advantages over pure Python, other databases, and Pandas.

Anyone who has leaned toward libraries like Pandas and databases like Sqlite probably knows that using pure Python for data processing can be a time-consuming and academic exercise in data structures and algorithms. It can be very efficient but it can also be slow and buggy if you aren't careful. Using SQL is more portable and the declarative nature of SQL encourages concise expression of the data you want, leaving complex procedural logic to the engine.

When it comes to databases, not all lightweight databases are intended for analytical workloads (e.g. processing whole tables rather than a single row). Databases like Postgres or Sqlite require installation and configuration to setup. This can get in the way for those who don't need durability, concurrent access, or authorization controls.

Compared to Pandas, DuckDB provides speed and smaller package sizes at the expense of the maturity and broad feature set of Pandas. While this might seem like a disadvantage at first, once you have tried it out yourself and seen the snappy functionality, you might start to feel that Pandas is bloated.

Test Results

I set up a test to compare performance in DuckDB, Pandasql, and Python built-ins using a small CSV input file (15 records, <1KB). The code reads the data, applies basic filtering and sorting, and saves a new file. DuckDB performs 12x better than Pandas in this scenario, although it takes twice as long as regular Python built-ins. That said, it requires 40% less code than built-ins and took a fraction of the time to write. I'm not testing against other databases here since others have done this already.

Code Samples

DuckDB Code

import duckdb

load_csv_query = "CREATE TABLE my_table AS SELECT * FROM read_csv('data.csv');"


transform_query = """

CREATE TABLE new_table AS

SELECT *

FROM my_table

"""


write_csv_query = "COPY new_table TO 'output.csv' (HEADER, DELIMITER ',');"


duckdb.sql(load_csv_query)

duckdb.sql(transform_query)

duckdb.sql(write_csv_query)

Pandas Code

import pandas

import pandasql


latest_events_query = """

WITH latest AS (

   SELECT id, max(timestamp) AS timestamp

   FROM events

   GROUP BY id

)


SELECT events.*

FROM latest

LEFT JOIN events

ON latest.id = events.id AND latest.timestamp = events.timestamp

ORDER BY events.timestamp

"""


events = pandas.read_csv("events.csv")

latest = pandasql.sqldf(latest_events_query, locals())

pandas.DataFrame(latest).to_csv("pandas_output.csv", header=True, index=False)

Python Built-Ins Code

input_lines = []

with open("events.csv", "r") as file:

   input_lines = file.readlines()


header = input_lines.pop(0)

sorted_lines = sorted(input_lines)

latest_events = {}

for line in sorted_lines:

   id, _, _, new_timestamp, _ = line.split(",")

   if id not in latest_events:

       latest_events[id] = line

   else:

       _, _, _, old_timestamp, _ = latest_events[id].split(",")

       if new_timestamp > old_timestamp:

           latest_events[id] = line

       else:

           continue


filtered_lines = []

for line in latest_events.values():

   id, title, ownerid, timestamp, payload = line.split(",")

   filtered_lines.append(",".join([timestamp, id, title, ownerid, payload]))


output_lines = sorted(filtered_lines)

sorted_output_lines = []

for line in output_lines:

   timestamp, id, title, ownerid, payload = line.split(",")

   sorted_output_lines.append(",".join([id, title, ownerid, timestamp, payload]))


sorted_output_lines.insert(0, header)


with open("builtins_output.csv", "w") as file:

   file.writelines(sorted_output_lines)

Additional Resources

Did you find this useful?