Once a function environment has some stuff in it, serializing all this stuff (even when it is not needed) adds a big overhead to parallelization. Is there then an effective way to use parallelization within a function? I've tried the future library but I need persistent workers, and would rather stick with base R if feasible. Example:
test<-function(){
clct=parallel::makeCluster(4)
a=Sys.time()
parallel::clusterCall(clct,function(x) 1)
print(Sys.time()-a)
big <- matrix(rnorm(8000000))
a=Sys.time()
parallel::clusterCall(clct,function(x) 1)
print(Sys.time()-a)
parallel::stopCluster(clct)
}
test()
Time difference of 0.0009980202 secs
Time difference of 0.8078392 secs
If I simply put the lines calling the cluster in their own function in the global environment, this works fine, but then as soon as I pass anything (in this case, y=4) from the test function environment, it's again broken:
f1=function(x,y){
a=Sys.time()
parallel::clusterCall(x,function(x) y)
print(Sys.time()-a)
}
test2<-function(){
clct=parallel::makeCluster(4)
f1(clct,4)
big <- matrix(rnorm(8000000))
f1(clct,4)
parallel::stopCluster(clct)
}
test2()