Kicking tyres

UPDATED 2024-04-21: Fixed typos and added package version numbers for the benchmarks.

Background

Over the last couple of months a few things pushed me to finally have a play with DuckDB:

Firstly, a great little episode of the Developer Voices podcast that featured DuckDB co-creator Hannes Mühleisen, speaking about Implementing Hardware-Friendly Databases. Without giving away too many spoilers, my main takeaways were; (a) the hardware we all have available to us is very different from 20 years ago and we should be taking advantage of this; (b) DuckDB are trying to combine years of academic research with software engineering best practices; and (c), perhaps most importantly, they want to ensure great usability with DuckDB aiming to be simple to use and run on a wide range of devices (without being coddled by dedicated DBAs).
Secondly, a blog about duckplyr by the aforementioned Hannes. Whilst I’d been aware of the duckdb package, I wasn’t aware of this alternative where I could benefit from all of the DuckDB goodness, without having to concern myself with the DBI interface. In essence this could be used as a drop in replacement for dplyr. This seemed to good to be true!
Finally, a project I'm collaborating on at ${DAYJOB} was going to be producing a large amount of simulation data and I’d been pondering what the best strategy for handling this would be.

A skeptic

To the annoyance of my better half, I’m a naturally sceptical person. People can tell me how quick a piece of software is but, until I’ve tried it myself, there will always be some doubt in the back of my mind. To that end, whilst I could simply look at the reinvigorated H2O.ai benchmarks, I needed something a little simpler that I could run to get a feel of things. Fortunately duckplyr provides a demo repository containing a handful of examples I could noodle at. I grabbed a ☕ and my trusty measuring tape (data.table).

Benchmarks

The demo examples make use of a a parquet data set containing NY Taxi trip data from 2019. I had not really looked at this data before but do remember it being a popular thing to write about a few years ago. There are four queries that can be reimplemented with data.table:

Average tip by weekday and hour.
Median tip by number of passengers.
Popularity of Manhattan pick-up/drop-off combinations.
Percentage of trips that report no tip grouped by pick-up/drop-off borough and ranked by number of trips.

I ran the duckplyr (v0.3.2 utilising duckdb v0.10.1) and data.table (v1.15.4) implementations of each query across the full years worth of data. I also ran the same queries across a smaller, three month, subset of the data.

Median timings (seconds) over 5 repetitions
months	benchmark	duckplyr	data.table	ratio
12	1	0.27	6.46	24.17
12	2	1.05	5.50	5.23
12	3	0.40	27.80	69.36
12	4	0.59	6.63	11.32
3	1	0.08	1.59	21.10
3	2	0.26	1.38	5.26
3	3	0.11	6.91	60.25
3	4	0.19	1.71	8.83

Looking at the table above it is fair to say my mind was 🤯. Depending on query and data size duckplyr was performing between 5 and 70 times as quick as data.table. More impressive to me though was the absolute performance on the largest, 12 month, data set. This was a data.frame occupying ~15gb of memory in my R session and duckplyr was able to perform all 4 queries in a little over 2 seconds! Now it was time for 🍷 to help take this in.

Replication details

Sketches of the different implementations are given below but, for those interested, there is a repository set up with a more replicable workflow for running the benchmarks.

I’m going to keep replicating the benchmark across new data.table releases. Development there is picking up speed again and I’m sure we will see further performance improvements as time goes on¹.

Due to the lazy nature of duckdb a like-for-like translation of the examples in to data.table syntax could be quite inefficient memory wise. For this reason I’ve tried in each example to code what I think is a reasonable approach. That said, especially where joins are involved, it would not surprise me if they could be improved further. Please reach out if so.

One thing I was concious of was how the data got loaded in to memory within R. Whilst I could have used duckdb to load the data directly from the parquet files I was unsure if this approach would utilise any altrepiness and indirectly benefit duckplyr. For this reason I took a somewhat convoluted approach; first loading the parquet data with duckdb, then writing it back to disk as a csv and, finally, reloading back in to memory with data.table. Perhaps this was unnecessary but 🤷.

query 1

# duckplyr
taxi_dat |>
    as_duckplyr_df() |>
    filter(total_amount > 0) |>
    mutate(
        tip_pct = 100 * tip_amount / total_amount,
        dn = wday(pickup_datetime),
        hr = hour(pickup_datetime)
    ) |>
    summarise(avg_tip_pct = mean(tip_pct), n = n(), .by = c(dn, hr)) |>
    arrange(desc(avg_tip_pct)) |>
    as.data.frame()

# data.table
setDT(taxi_dat)
taxi_dat[total_amount > 0
        ][,let(tip_pct = 100 * tip_amount / total_amount,
               dn = wday(pickup_datetime),
               hr = hour(pickup_datetime))
        ][,.(avg_tip_pct = mean(tip_pct), n = .N),by = c("dn", "hr")
        ][order(avg_tip_pct, decreasing = TRUE)]

query 2

# duckplyr
taxi_dat |>
    as_duckplyr_df() |>
    filter(total_amount > 0) |>
    mutate(
        tip_pct = 100 * tip_amount / total_amount,
        dn = wday(pickup_datetime),
        hr = hour(pickup_datetime)
    ) |>
    summarise(avg_tip_pct = mean(tip_pct), n = n(), .by = c(dn, hr)) |>
    arrange(desc(avg_tip_pct)) |>
    as.data.frame()

# data.table
setDT(taxi_dat)
taxi_dat[total_amount > 0
        ][, tip_pct := 100 * tip_amount / total_amount
        ][, .(avg_tip_pct = median(tip_pct), n = .N), by = passenger_count
        ][order(passenger_count, decreasing = TRUE)]

query 3

# duckplyr
zone_map <- as_duckplyr_df(zone_dat) |> filter(Borough == "Manhattan")

taxi_dat |>
    as_duckplyr_df() |>
    filter(total_amount > 0) |>
    inner_join(zone_map, by = join_by(pickup_location_id == LocationID)) |>
    inner_join(zone_map, by = join_by(dropoff_location_id == LocationID)) |>
    select(start_neighbourhood = Zone.x, end_neighbourhood = Zone.y) |>
    summarise(
        num_trips = n(),
        .by = c(start_neighbourhood, end_neighbourhood),
    ) |>
    arrange(desc(num_trips), start_neighbourhood, end_neighbourhood) |>
    as.data.frame()

# data.table
zone_map <- setDT(zone_dat)[Borough == "Manhattan"]
out <-
    setDT(taxi_dat)[total_amount > 0
    ][zone_map, on = "pickup_location_id == LocationID", nomatch = NULL
    ][zone_map, on = "dropoff_location_id == LocationID", nomatch = NULL
    ][, .(start_neighbourhood = Zone, end_neighbourhood = i.Zone)
    ][, .(num_trips = .N), by = start_neighbourhood:end_neighbourhood]
setorder(out, -num_trips, start_neighbourhood, end_neighbourhood)[]

query 4

# duckplyr
zone_map <- as_duckplyr_df(zone_dat)

num_trips_per_borough <-
    taxi_dat |>
    as_duckplyr_df() |>
    filter(total_amount > 0) |>
    inner_join(zone_map, by = join_by(pickup_location_id == LocationID)) |>
    inner_join(zone_map, by = join_by(dropoff_location_id == LocationID)) |>
    mutate(pickup_borough = Borough.x, dropoff_borough = Borough.y) |>
    select(pickup_borough, dropoff_borough, tip_amount) |>
    summarise(num_trips = n(), .by = c(pickup_borough, dropoff_borough))

num_trips_per_borough_no_tip <-
    taxi_dat |>
    as_duckplyr_df() |>
    filter(total_amount > 0, tip_amount == 0) |>
    inner_join(zone_map, by = join_by(pickup_location_id == LocationID)) |>
    inner_join(zone_map, by = join_by(dropoff_location_id == LocationID)) |>
    mutate(pickup_borough = Borough.x, dropoff_borough = Borough.y) |>
    summarise(
        num_zero_tip_trips = n(),
        .by = c(pickup_borough, dropoff_borough)
    )

num_trips_per_borough |>
    inner_join(
        num_trips_per_borough_no_tip,
        by = join_by(pickup_borough, dropoff_borough)
    ) |>
    mutate(
        num_trips = num_trips,
        percent_zero_tips_trips = 100 * num_zero_tip_trips / num_trips
    ) |>
    select(pickup_borough, dropoff_borough, num_trips, percent_zero_tips_trips) |>
    arrange(desc(percent_zero_tips_trips), pickup_borough, dropoff_borough) |>
    as.data.frame()

# data.table
setDT(taxi_dat)
setDT(zone_dat)
out <- taxi_dat[total_amount > 0, .(pickup_location_id, dropoff_location_id, tip_amount)]
ntpb <- out[,.(num_trips = .N), by = pickup_location_id:dropoff_location_id]
out <- out[tip_amount == 0,.(num_zero_tip_trips = .N),
           by = pickup_location_id:dropoff_location_id
          ][ntpb, on = .NATURAL
          ][zone_dat, on = "pickup_location_id == LocationID", nomatch = NULL
          ][zone_dat, on = "dropoff_location_id == LocationID", nomatch = NULL]
setnafill(out, fill = 0, nan = NA, cols = "num_zero_tip_trips")
setnames(out, c("Borough", "i.Borough"), c("pickup_borough", "dropoff_borough"))
out <- out[, .(num_trips = sum(num_trips),num_zero_tip_trips = sum(num_zero_tip_trips)), by = c("pickup_borough", "dropoff_borough")]
out <- out[, let(percent_zero_tips_trips = 100 * num_zero_tip_trips / num_trips, num_zero_tip_trips = NULL)]
setorder(out, -percent_zero_tips_trips, pickup_borough, dropoff_borough)[]

Final thoughts

I’m excited to see what the future will bring with duckplyr and will continue to kick the tyres over the coming months. It’s still early days and there are inevitably a few rough edges but the potential is huge.

The package is being maintained/developed by Kirill Müller via a partnership of cynkra, DuckDB and Posit. Poking around GitHub, I’ve also noticed the involvement of Romain François which is great to see. Hopefully this collaboration continues well in to the future.

I tried writing this final sentence a few times, wanting to convey my view that R is the perfect language for creating these elegant interfaces to other software. Eventually I realised I could find no better words than those of Kirill in his recent R Consortium article² so will simply end with some selected paragraphs from there:

R has unique strengths as a powerful interface language. R treats data frames as first-class data structures. Functions and expressions are first-class objects, enabling easy parsing, computing, and emitting code, fostering robust programming practices. Moreover, R’s “pass by value” semantics (to be more accurate, “pass by reference and copy on write) ensure that functions do not inadvertently alter your data. This eliminates concerns over state management and makes data manipulation both predictable and secure.

And here’s the true power of R: As an interface language, R enables the use of external, high-speed engines—be it DuckDB, Arrow, traditional databases, or data processing tools like data.table and collapse—for computation, while R itself is used to compose the commands for these engines. This integration showcases R’s versatility and efficiency by leveraging the strengths of these specialized engines for heavy lifting, thereby bypassing its performance limitations in certain areas.

Therefore, the focus should not be just on performance in isolation but rather on what can be achieved through R’s integration with other systems. This flexibility allows R to extend its utility well beyond mere data processing, making it a potent tool not only for technical users but also accessible to those with less technical expertise. The intuitive syntax of R, especially with domain-specific languages like dplyr, makes it exceptionally user-friendly, resembling plain English and facilitating a smoother workflow for a wide range of tasks.

Fin.

The current data.table performance is still far from shabby and it's important to note that, contrary to what many will say, the package has always been about more than just raw speed. With it’s elegant syntax and swiss-army like functionality I’m sure I’ll still be getting plenty of use out of it for many years. ↩
Well worth a read. Building Data Highways: Kirill Müller’s Journey in Enhancing R’s Database, April 18, 2024 ↩