I've been wanting to demonstrate to a friend the elegance and speed of using dplyr
's join verbs (e.g. inner_join()
) over base R and simple subsetting. Took a big DB (from the nycflights13
package), started with a simple task, and to my surprise base R and simple subsetting was up to 10 times faster! And I could only really demonstrate the elegance, not speed.
Question is: what am I missing, when does dplyr
's join verbs surpass base R and simple subsetting in performance? Do they ever?...
(P.S.: I know about data.table
's excellent performance, asking about dplyr
)
My Demo:
library(tidyverse)
library(nycflights13)
library(microbenchmark)
dim(flights)
[1] 336776 19
dim(airports)
[1] 1458 8
Task is: get the unique tailnum
s of all planes in flights where dest
ination airport tzone
was "America/New_York":
base_no_join <- function() {
unique(flights$tailnum[flights$dest %in% airports$faa[airports$tzone == "America/New_York"]])
}
dplyr_no_join <- function() {
flights %>%
filter(dest %in% (airports %>%
filter(tzone=="America/New_York") %>%
pull(faa))) %>%
pull(tailnum) %>%
unique()
}
dplyr_join <- function() {
flights %>%
inner_join(airports, by = c("dest" = "faa")) %>%
filter(tzone == "America/New_York") %>%
pull(tailnum) %>%
unique()
}
See that they give the same results:
all.equal(dplyr_join(), dplyr_no_join())
[1] TRUE
all.equal(dplyr_join(), base_no_join())
[1] TRUE
Now benchmark:
microbenchmark(base_no_join(), dplyr_no_join(), dplyr_join(), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
base_no_join() 9.7198 10.1067 13.16934 11.19465 13.4736 24.2831 10
dplyr_no_join() 21.2810 22.9710 36.04867 26.59595 34.4221 108.0677 10
dplyr_join() 60.7753 64.5726 93.86220 91.10475 119.1546 137.1721 10
Please help finding an example which shows this join's superiority if it exists.