Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201839

Fast calculation of CDF / rolling join on multiple columns

$
0
0

I'm trying to measure the empirical cumulative distribution of some data in a multivariate setting. That is, given a dataset like

library(data.table)  # v 1.9.7

set.seed(2016)
dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
dt
             x        y       z
   1: -0.91474  2.07025 -1.7499
   2:  1.00125 -1.80941 -1.3856
   3: -0.05642  1.58499  0.8110
   4:  0.29665 -1.16660  0.3757
   5: -2.79147 -1.75526  1.2851
  ---                          
 996:  0.63423  0.13597 -2.3710
 997:  0.21415  1.03161 -1.5440
 998:  1.15357 -1.63713  0.4191
 999:  0.79205 -0.56119  0.6670
1000:  0.19502 -0.05297 -0.3288

I want to count the number of samples such that (x <= X, y <= Y, z <= Z) for some grid of (X, Y, Z) upper bounds like

bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))
bounds
        X  Y    Z
    1: -2 -2 -2.0
    2: -2 -2 -1.9
    3: -2 -2 -1.8
    4: -2 -2 -1.7
    5: -2 -2 -1.6
   ---           
68917:  2  2  1.6
68918:  2  2  1.7
68919:  2  2  1.8
68920:  2  2  1.9
68921:  2  2  2.0

Now, I've figured out that I can elegantly do this (using non-equi joins)

dt[, Count := 1]
result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]
result[, CDF := N.cum/nrow(dt)]
result
        X  Y    Z N.cum   CDF
    1: -2 -2 -2.0     0 0.000
    2: -2 -2 -1.9     0 0.000
    3: -2 -2 -1.8     0 0.000
    4: -2 -2 -1.7     0 0.000
    5: -2 -2 -1.6     0 0.000
   ---                       
68917:  2  2  1.6   899 0.899
68918:  2  2  1.7   909 0.909
68919:  2  2  1.8   917 0.917
68920:  2  2  1.9   924 0.924
68921:  2  2  2.0   929 0.929

But this method is really inefficient and gets very slow as I start increasing the bin count. I think a multivariate version of data.table's rolling join functionality would do the trick, but that's not possible to my knowledge. Any suggestions to speed this up?


Viewing all articles
Browse latest Browse all 201839

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>