I'd like to be able to get the same performance with docker as I get in RStudio. I have Docker Desktop installed on Windows 10 and am using Linux containers. The goal is to containerize R scripts for general use. An R script dtbenchmark.R (adapted from the data.table benchmark script by Matt Dowle), that encapsulates the problem that I'm having, is
library(data.table)
K <- 100L
rows <- c(1e7L, 1:7*1e8L)
for (i in 1:length(rows)) {
tme <- proc.time()
N <- rows[i]
set.seed(1)
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE)) # numeric e.g. 23.5749
GB <- round(sum(gc()[,2])/1024, 3)
rt <- round(proc.time() - tme, 2)
print(paste0('i = ', i, ' N = ', N, ' K = ', K, ' GB = ', GB, ' seconds = ', rt[3]), quote = FALSE)
rm(N, DT, GB, rt)
}
The Dockerfile is
FROM rocker/r-ver:3.4.3
RUN Rscript -e "install.packages('https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.0.tar.gz', repo = NULL, type = 'source')"
COPY . /root
WORKDIR /root
CMD ["Rscript", "dtbenchmark.R"]
In RStudio, the script dtbenchmark.R is able to get through five loops, before exiting with an error message, as in
[1] i = 1 N = 10000000 K = 100 GB = 0.532 seconds = 2.64
[1] i = 2 N = 100000000 K = 100 GB = 4.954 seconds = 44.58
[1] i = 3 N = 200000000 K = 100 GB = 9.868 seconds = 170.53
[1] i = 4 N = 300000000 K = 100 GB = 14.778 seconds = 426.42
[1] i = 5 N = 400000000 K = 100 GB = 19.688 seconds = 1013.77
Error: cannot allocate vector of size 3.7 Gb
With the Dockerfile and dtbenchmark.R in the same folder, in Windows PowerShell the docker command in that folder to build the image is
docker build -t dtbenchmark .
Then the docker command in Windows PowerShell to run the container is
docker run --rm dtbenchmark:latest
In PowerShell, the container only gets through three loops, before exiting with no message, as in
[1] i = 1 N = 10000000 K = 100 GB = 0.515 seconds = 2.08
[1] i = 2 N = 100000000 K = 100 GB = 4.937 seconds = 41.3
[1] i = 3 N = 200000000 K = 100 GB = 9.851 seconds = 91.81
My laptop has Windows 10 Enterprise, 48 GB of RAM and a 64-bit OS. I'm not able to run as administrator.