Supercharge your R workflows with DuckDB
Speed up your data wrangling in R without rewriting your codebase. DuckDB lets you keep using dplyr-style syntax while handling large datasets effortlessly.
Intro
In last week’s post, I made the case that R's ecosystem offers superior tools for data analysis compared to Python's. The response was overwhelming: people who understand R agreed, folks who like Python said it’s in no way inadequate (I agree), and wiser ones said use whatever you prefer.
This week, I want to spotlight how the beloved dplyr grammar extends beyond in-memory data. Thanks to community-driven packages like dbplyr
for SQL databases and dtplyr
for data.table
, you can keep writing the same tidy verbs while tapping into each backends’ speed and scalability, no syntax switch required.
When data.frame
hits a wall
I used to hit this wall frequently on my 16GB RAM laptop. Between Chrome's endless hunger for memory, Slack, Spotify, and Windows, I'm typically down to about 2-3GB of available RAM before I even open RStudio.
My breaking point came last year while analyzing some ads data for my research. Everything was smooth when working with samples, but the moment I tried processing the full dataset (~120K rows with 30+ columns), R ground to a halt. Every tiny change meant waiting minutes for my dplyr
pipelines to finish, which was quite frustrating to say the least.
I started digging for solutions. It’s clear that data.frame’s memory usage grows linearly with the number of rows. The thing is, when you start doing real analysis with group_bys, joins, and transformations, memory usage can spike dramatically because each transformation creates a copy of your data.
Python users face similar challenges with Pandas, but they've been quicker to adopt alternatives like PySpark. Meanwhile, many R users I talk to aren't aware that they have equally powerful options that maintain dplyr's elegant syntax.
Same Syntax, Different Backends
If your data still fits in memory, you might be able to get more performance out of data.table depending on what you’re doing. A package written in C, it's essentially a drop-in replacement for data.frame
with added benefits:
Transformations are done with in-memory replacement.
Much faster file reads and writes using
fread
andfwrite
.Common operations are parallelized to use multiple CPU threads.
I did some performance comparisons on synthetic data, and it’s clear that data.table
is faster across the board and especially if you’re doing a lot of aggregations. FYI, the complex pipeline in the graph below consists of a left join, mutate, aggregation, and sorting.
While data.table
has its own syntax (which is incredibly powerful but has a steeper learning curve), you can combine it with dplyr
, giving you familiar syntax with better performance. While data.table
is a step in the right direction, it’s not a definitive answer. But if you’re running into memory or processing speed issues, it’s worth giving it a try.
Why DuckDB
I can’t say enough good things about DuckDB but I’ll wrap it up in a few points, it’s an in-process SQL-based analytics engine with several benefits:
Analytics focused, state-of-the-art query execution engine
In-process, no need to connect to a server
No external dependencies
Works with CSV, JSON, Parquet, Avro out of the box
Can talk directly to Amazon S3, Postgres, SQLite etc.
Can use the disk if there’s not enough memory to execute the query
I set up a quick benchmark to compare all three backends which consisted of ingesting 10 million rows, performing aggregation, and sorting—and the results matched my expectations:
We see that DuckDB is a whopping ~34× faster than a data.frame
and roughly 4.5x faster than data.table
, even though it’s doing extra work such as writing files back to disk transactionally and maintaining statistics, schema, and more. If you’re interested, the folks at DuckDB maintain a more comprehensive benchmark.
I recently watched a talk where they migrated their monthly data processing jobs from Snowflake to DuckDB running on EC2 instances, achieving about a 92% cost reduction. This demonstrates that DuckDB doesn’t just save you time; it can save you money too. Now that you’re hopefully convinced to give it a try, let’s dive into using it with R.
Incorporating DuckDB in your workflows
The best way to incorporate DuckDB in your data wrangling and analysis is to let it take care of the most resource intensive tasks, think joins, aggregations, complex transformations, pivots, JSON parsing and cleaning—so that once the data is ready, you can focus on visualization or move straight into modeling with your favorite R package.
Initialization
Getting started with DuckDB is easy, all you need is DuckDB and the dbplyr
package:
install.packages("duckdb")
install.packages("dplyr")
install.packages("dbplyr")
And as with most databases, you’ll first need to initialize a connection object:
library(duckdb)
library(dplyr)
# Connect to an in-memory DuckDB database
con <- dbConnect(duckdb())
# OR
con <- dbConnect(duckdb(), dbdir = ":memory:")
# Connect to an existing database file or create a new one
con <- dbConnect(duckdb(), dbdir = "mydata.db", read_only = FALSE)
Keep in mind that when using the in-memory option, no data will persist on disk (i.e., all data is lost when you exit the R process). The read_only
flag is optional and can be set to true if you’d like to prevent any modifications to the database.
Loading data
This can be done several ways and since df
is just a pointer your memory consumption with the R workspace remains low.
# Use the tbl method
df <- tbl(con, "read_csv('data.csv')")
# OR
dbExecute(con, "create table df as select * from read_csv('mydata.csv')")
df <- tbl(con, "df")
As I mentioned earlier, DuckDB natively supports most data formats, so you can effortlessly load your data with functions like read_csv()
, read_parquet()
, or read_json()
etc. If you have data in an existing data.frame
, you could move it to your database like so:
dbWriteTable(con, "testdata", testdata, overwrite = TRUE)
Working with dplyr
Since DuckDB is an SQL engine, you could write SQL queries and work with your data that way:
# Executes immediately because of dbGetQuery
result <- dbGetQuery(con,
"SELECT t.group, avg(value) as avg_value
FROM read_csv('test_data.csv') AS t
GROUP BY t.group
ORDER BY avg_value DESC")
But working with SQL strings in your R code is no fun. Thankfully, once your data is loaded into DuckDB and you have a table reference like our df
variable, you can wrangle it exactly as you would with dplyr
:
# Using dplyr
result <- df |>
group_by(group) |>
summarize(avg_value = mean(value)) |>
arrange(desc(avg_value))
It’s important to understand what’s happening under the hood, the result
variable is also just a pointer, and the transformations don’t happen right away. Instead, DuckDB creates an optimized execution plan which will be evaluated once we need our data. We can even see the optimized query plan for our transformation:
explain(result)
--------
<SQL>
SELECT "group", AVG("value") AS avg_value
FROM (FROM read_csv('test_data.csv')) q01
GROUP BY "group"
ORDER BY avg_value DESC
<PLAN>
physical_plan
┌───────────────────────────┐
│ ORDER_BY │
│ ──────────────────── │
│ avg(q01."value") DESC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ Groups: #0 │
│ Aggregates: avg(#1) │
│ │
│ ~10313831 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ group │
│ value │
│ │
│ ~20627662 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ READ_CSV │
│ ──────────────────── │
│ Function: READ_CSV │
│ │
│ Projections: │
│ group │
│ value │
│ │
│ ~20627662 Rows │
└───────────────────────────┘
To trigger execution, you can call the collect()
method as it converts output of the query into an R data.frame
.
The sql()
method
Let’s talk about one of the things that got me hooked with DuckDB, the fact that it contains so many built-in functions that are well optimized to work with its query engine. There are functions for array & text manipulation, for pattern matching, for working with date and time, JSON parsing and more.
Remember the ads data I mentioned earlier? It had numerous columns full of JSON strings, and I needed to unpack them into rows before I could analyze anything. Back when I was working with data.frame
in R, I wrote custom functions for that parsing, which really slowed things down. Switching to DuckDB’s built-in text functions made the whole process much simpler and much faster. Let’s see how you can do the same without writing full-blown SQL queries.
The dplyr package provides a nifty sql()
method which allows you to interweave functions from you SQL backend with your dplyr
query. Let’s see an example:
df |>
select(id, demographics) |>
mutate(
demo = sql("unnest(from_json(demographics))")
) |>
mutate(
age = sql("json_extract(demo, '$.age')"),
gender = sql("json_extract(demo, '$.gender')")
)
This code snippet transforms the JSON data in the demographics
column by first unnesting it and then extracting specific fields (age
and gender
) into their own columns. Unnesting a JSON object is a heavy task and being able to offload it to DuckDB while maintaining a tidyverse workflow is amazing.
Things to watch out for
One thing to keep in mind is that since all processing takes place inside the query engine you can’t use custom R functions within your dplyr pipelines when working with DuckDB. And I understand that this can be a dealbreaker depending on what you’re doing, but it honestly has been a non-issue. Here’s you can navigate it:
1. By leveraging sql()
alongside DuckDB’s functions, you’ll quickly see that it has built-in support for almost anything you might need to do.
2. Or use collect()
before running your custom functions, for example:
# This won't work
tbl(con, "events") |>
mutate(score = custom_function(value))
# Instead, do this
tbl(con, "events") |>
collect() |>
mutate(score = custom_function(value))
Conclusion
DuckDB offers impressive performance capabilities and gels beautifully with R’s elegant syntax. If you encounter skepticism about R's capacity for large data processing, this approach demonstrates otherwise. With these tools, you can efficiently handle millions of rows on standard hardware while maintaining R's readable, expressive syntax.
What performance challenges have you encountered with R? Have you explored these alternative backends? I'd value hearing your experiences.
Finding value in these insights? Please consider subscribing if you haven't already and spread the word to colleagues who might benefit.
That means I can use only sql() based transformations in mutate/summarise and no dplyr/stringr functions?