Similar questions were asked before, but without clear generic answers. (And Joseph Adler's experiments are no longer on the web, and his book just says "write an S4 class.")
Assume a large lookup table with multiple indexes. Assume a modest size set of values to look up. Even an R merge is very slow. Here is an example:
{
L <- 100000000 ## only 100M entries for 1GB*4 of int data
lookuptable <- data.frame( i1=sample(1:L), i2=sample(1:L), v1=rnorm(L), v2=rnorm(L) )
NLUP <- 10 ## look up only 10+1 values in large table
vali <- sample(1:L, NLUP)
lookmeup <- data.frame( i1= c(lookuptable[vali,1], -1),
i2= c(lookuptable[vali,2],-1), vA=rnorm(11) )
rm(vali); rm(L)
}
## I want to speed this up---how?
system.time( merge( lookmeup, lookuptable, by.x=c("i1","i3"), by.y=c("i1","i2"),
all.x=T, all.y=F, sort=F ) )
(Try it! 500 second on my 2019 iMac). So what is the recommended way of doing this?
I could write code that creates unique integer fingerprints from the columns first (for fast comparisons), and then I just match on one column. But this is not easy either, 'cause I need to avoid accidental duplicate fingerprints, or add more logic for conflicts.
Given integer fingerprints, I could then use either data.table
with setkey
on the fingerprints (or can it encapsulate two-column indexes, too? I tried but failed, perhaps because I am not a regular user); or I could write a C program that takes two integer fingerprint columns and returns one.