I'm in the habit of clumping similar tasks together into a single line. For example, if I need to filter on a
, b
, and c
in a data table, I'll put them together in one[]
with ANDs. Yesterday, I noticed that in my particular case this was incredibly slow and tested chaining filters instead. I've included an example below.
First, I seed the random number generator, load data.table, and create a dummy data set.
# Set RNG seed
set.seed(-1)
# Load libraries
library(data.table)
# Create data table
dt <- data.table(a = sample(1:1000, 1e7, replace = TRUE),
b = sample(1:1000, 1e7, replace = TRUE),
c = sample(1:1000, 1e7, replace = TRUE),
d = runif(1e7))
Next, I define my methods. The first approach chains filters together. The second ANDs the filters together.
# Chaining method
chain_filter <- function(){
dt[a %between% c(1, 10)
][b %between% c(100, 110)
][c %between% c(750, 760)]
}
# Anding method
and_filter <- function(){
dt[a %between% c(1, 10) & b %between% c(100, 110) & c %between% c(750, 760)]
}
Here, I check they give the same results.
# Check both give same result
identical(chain_filter(), and_filter())
#> [1] TRUE
Finally, I benchmark them.
# Benchmark
microbenchmark::microbenchmark(chain_filter(), and_filter())
#> Unit: milliseconds
#> expr min lq mean median uq max
#> chain_filter() 25.17734 31.24489 39.44092 37.53919 43.51588 78.12492
#> and_filter() 92.66411 112.06136 130.92834 127.64009 149.17320 206.61777
#> neval cld
#> 100 a
#> 100 b
Created on 2019-10-25 by the reprex package (v0.3.0)
In this case, chaining reduces run time by about 70%. Why is this the case? I mean, what's going on under the hood in data table? I haven't seen any warnings against using &
, so I was surprised that the difference is so big. In both cases they evaluate the same conditions, so that shouldn't be a difference. In the AND case, &
is a quick operator and then it only has to filter the data table once (i.e., using the logical vector resulting from the ANDs), as opposed to filtering three times in the chaining case.
Bonus question
Does this principle hold for data table operations in general? Is modularising tasks always a better strategy?