I'm trying to measure the empirical cumulative distribution of some data in a multivariate setting. That is, given a dataset like
library(data.table) # v 1.9.7
set.seed(2016)
dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
dt
x y z
1: -0.91474 2.07025 -1.7499
2: 1.00125 -1.80941 -1.3856
3: -0.05642 1.58499 0.8110
4: 0.29665 -1.16660 0.3757
5: -2.79147 -1.75526 1.2851
---
996: 0.63423 0.13597 -2.3710
997: 0.21415 1.03161 -1.5440
998: 1.15357 -1.63713 0.4191
999: 0.79205 -0.56119 0.6670
1000: 0.19502 -0.05297 -0.3288
I want to count the number of samples such that (x <= X, y <= Y, z <= Z) for some grid of (X, Y, Z) upper bounds like
bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))
bounds
X Y Z
1: -2 -2 -2.0
2: -2 -2 -1.9
3: -2 -2 -1.8
4: -2 -2 -1.7
5: -2 -2 -1.6
---
68917: 2 2 1.6
68918: 2 2 1.7
68919: 2 2 1.8
68920: 2 2 1.9
68921: 2 2 2.0
Now, I've figured out that I can elegantly do this (using non-equi joins)
dt[, Count := 1]
result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]
result[, CDF := N.cum/nrow(dt)]
result
X Y Z N.cum CDF
1: -2 -2 -2.0 0 0.000
2: -2 -2 -1.9 0 0.000
3: -2 -2 -1.8 0 0.000
4: -2 -2 -1.7 0 0.000
5: -2 -2 -1.6 0 0.000
---
68917: 2 2 1.6 899 0.899
68918: 2 2 1.7 909 0.909
68919: 2 2 1.8 917 0.917
68920: 2 2 1.9 924 0.924
68921: 2 2 2.0 929 0.929
But this method is really inefficient and gets very slow as I start increasing the bin count. I think a multivariate version of data.table
's rolling join functionality would do the trick, but that's not possible to my knowledge. Any suggestions to speed this up?