Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

Performance benefits of chaining over ANDing when filtering a data table

$
0
0

I'm in the habit of clumping similar tasks together into a single line. For example, if I need to filter on a, b, and c in a data table, I'll put them together in one[] with ANDs. Yesterday, I noticed that in my particular case this was incredibly slow and tested chaining filters instead. I've included an example below.

First, I seed the random number generator, load , and create a dummy data set.

# Set RNG seed
set.seed(-1)

# Load libraries
library(data.table)

# Create data table
dt <- data.table(a = sample(1:1000, 1e7, replace = TRUE),
                 b = sample(1:1000, 1e7, replace = TRUE),
                 c = sample(1:1000, 1e7, replace = TRUE),
                 d = runif(1e7))

Next, I define my methods. The first approach chains filters together. The second ANDs the filters together.

# Chaining method
chain_filter <- function(){
  dt[a %between% c(1, 10)
     ][b %between% c(100, 110)
       ][c %between% c(750, 760)]
}

# Anding method
and_filter <- function(){
  dt[a %between% c(1, 10) & b %between% c(100, 110) & c %between% c(750, 760)]
}

Here, I check they give the same results.

# Check both give same result
identical(chain_filter(), and_filter())
#> [1] TRUE

Finally, I benchmark them.

# Benchmark
microbenchmark::microbenchmark(chain_filter(), and_filter())
#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq       max
#>  chain_filter() 25.17734  31.24489  39.44092  37.53919  43.51588  78.12492
#>    and_filter() 92.66411 112.06136 130.92834 127.64009 149.17320 206.61777
#>  neval cld
#>    100  a 
#>    100   b

Created on 2019-10-25 by the reprex package (v0.3.0)

In this case, chaining reduces run time by about 70%. Why is this the case? I mean, what's going on under the hood in data table? I haven't seen any warnings against using &, so I was surprised that the difference is so big. In both cases they evaluate the same conditions, so that shouldn't be a difference. In the AND case, & is a quick operator and then it only has to filter the data table once (i.e., using the logical vector resulting from the ANDs), as opposed to filtering three times in the chaining case.

Bonus question

Does this principle hold for data table operations in general? Is modularising tasks always a better strategy?


Viewing all articles
Browse latest Browse all 201867

Trending Articles