Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201839

fast R lookup table

$
0
0

Similar questions were asked before, but without clear generic answers. (And Joseph Adler's experiments are no longer on the web, and his book just says "write an S4 class.")

Assume a large lookup table with multiple indexes. Assume a modest size set of values to look up. Even an R merge is very slow. Here is an example:

{
    L <- 100000000  ## only 100M entries for 1GB*4 of int data
    lookuptable  <- data.frame( i1=sample(1:L), i2=sample(1:L), v1=rnorm(L), v2=rnorm(L) )
    NLUP <- 10      ## look up only 10+1 values in large table
    vali <- sample(1:L, NLUP)
    lookmeup <- data.frame( i1= c(lookuptable[vali,1], -1),
                       i2= c(lookuptable[vali,2],-1), vA=rnorm(11) )
    rm(vali); rm(L)
}

## I want to speed this up---how?
system.time( merge( lookmeup, lookuptable,  by.x=c("i1","i3"), by.y=c("i1","i2"),
                   all.x=T, all.y=F, sort=F ) )

(Try it! 500 second on my 2019 iMac). So what is the recommended way of doing this?

I could write code that creates unique integer fingerprints from the columns first (for fast comparisons), and then I just match on one column. But this is not easy either, 'cause I need to avoid accidental duplicate fingerprints, or add more logic for conflicts.

Given integer fingerprints, I could then use either data.table with setkey on the fingerprints (or can it encapsulate two-column indexes, too? I tried but failed, perhaps because I am not a regular user); or I could write a C program that takes two integer fingerprint columns and returns one.


Viewing all articles
Browse latest Browse all 201839

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>