Occasionally, we find novice R programmers build data frames in a for
loop, usually by initializing an empty data frame and then iteratively calling rbind
. To respond to this inefficient approach, we often cite Patrick Burns'R Inferno - Circle 2: Growing Objects who emphasizes the hazard of this situation.
In Python pandas (the other open-source data science tool), experts have asserted the quadratic copy and O(N^2)
logic: (@unutbu here, @Alexander here). Additionally, docs (see section note) stress the copying problem of datasets and wiki explains Python's list.append
does not have the copy problem. I wonder if similar constructs apply to R.
Specifically, my question:
- Can timing alone illustrate or quantify the growing object in loop problem? See
microbenchmark
results below. Burns shows timings to illustrate the computational challenge to create a sequence. - Or does memory usage illustrate or quantify the growing object in loop problem? See
RProf
results below. Burns cites usingRProf
to show memory consumption within code. - Or is the growing object problem, context-specific, with general rule of thumb to avoid loops in building objects?
Consider following examples of growing a random data frame of 500 rows in a loop and using a list:
grow_df_loop <- function(n) {
final_df <- data.frame()
for(i in 1:n) {
df <- data.frame(
group = sample(c("sas", "stata", "spss", "python", "r", "julia"), 500, replace=TRUE),
int = sample(1:15, 500, replace=TRUE),
num = rnorm(500),
char = replicate(500, paste(sample(c(LETTERS, letters, c(0:9)), 3, replace=TRUE), collapse="")),
bool = sample(c(TRUE, FALSE), 500, replace=TRUE),
date = as.Date(sample(10957:as.integer(Sys.Date()), 500, replace=TRUE), origin="1970-01-01")
)
final_df <- rbind(final_df, df)
}
return(final_df)
}
grow_df_list <- function(n) {
df_list <- lapply(1:n, function(i)
df <- data.frame(
group = sample(c("sas", "stata", "spss", "python", "r", "julia"), 500, replace=TRUE),
int = sample(1:15, 500, replace=TRUE),
num = rnorm(500),
char = replicate(500, paste(sample(c(LETTERS, letters, c(0:9)), 3, replace=TRUE), collapse="")),
bool = sample(c(TRUE, FALSE), 500, replace=TRUE),
date = as.Date(sample(10957:as.integer(Sys.Date()), 500, replace=TRUE), origin="1970-01-01")
)
)
final_df <- do.call(rbind, df_list)
return(final_df)
}
Timing
Benchmarking by timing confirms the list approach is more efficient across the different number of iterations. But given reproducible, uniform data examples can timing results capture the difference of object growth?
library(microbenchmark)
microbenchmark(grow_df_loop(50), grow_df_list(50), times = 5L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# grow_df_loop(50) 758.2412 762.3489 809.8988 793.3590 806.4191 929.1256 5 b
# grow_df_list(50) 554.3722 562.1949 577.6891 568.7658 589.8565 613.2560 5 a
microbenchmark(grow_df_loop(100), grow_df_list(100), times = 5L)
# Unit: seconds
# expr min lq mean median uq max neval cld
# grow_df_loop(100) 2.223617 2.225441 2.425668 2.233529 2.677309 2.768447 5 b
# grow_df_list(100) 1.211181 1.255191 1.325670 1.287821 1.396905 1.477252 5 a
microbenchmark(grow_df_loop(500), grow_df_list(500), times = 5L)
# Unit: seconds
# expr min lq mean median uq max neval cld
# grow_df_loop(500) 38.78245 39.74367 41.54976 40.10221 44.36565 44.75483 5 b
# grow_df_list(500) 13.37076 13.90227 14.67498 14.53042 15.49942 16.07203 5 a
Memory Usage
Additionally, profiling by memory shows "rbind"
memory totals sizeably growing with iteration size but more pronounced with loop approach than list approach. Given a reproducible, uniform example can mem.total
results capture the difference of object growth? Any other approach to use?
Loop Approach
n = 50
utils::Rprof(tmp <- tempfile(), memory.profiling = TRUE)
output_df1 <- grow_df_loop(50)
utils::Rprof(NULL)
summaryRprof(tmp, memory="both")
unlink(tmp)
# $by.total
# total.time total.pct mem.total self.time self.pct
# "grow_df_loop" 0.58 100.00 349.1 0.00 0.00
# "data.frame" 0.38 65.52 209.4 0.00 0.00
# "paste" 0.28 48.28 186.4 0.06 10.34
# "FUN" 0.26 44.83 150.8 0.02 3.45
# "lapply" 0.26 44.83 150.8 0.00 0.00
# "replicate" 0.26 44.83 150.8 0.00 0.00
# "sapply" 0.26 44.83 150.8 0.00 0.00
# "sample" 0.20 34.48 131.4 0.08 13.79
# "rbind" 0.20 34.48 139.7 0.00 0.00
# "[<-.factor" 0.12 20.69 66.0 0.10 17.24
# "[<-" 0.12 20.69 66.0 0.00 0.00
# "factor" 0.10 17.24 47.8 0.04 6.90
# "as.data.frame" 0.10 17.24 48.5 0.00 0.00
# "as.data.frame.character" 0.10 17.24 48.5 0.00 0.00
# "order" 0.06 10.34 12.9 0.06 10.34
# "as.vector" 0.04 6.90 38.7 0.04 6.90
# "sample.int" 0.04 6.90 18.7 0.02 3.45
# "as.vector.factor" 0.04 6.90 38.7 0.00 0.00
# "deparse" 0.04 6.90 35.6 0.00 0.00
# "!" 0.02 3.45 18.7 0.02 3.45
# ":" 0.02 3.45 0.0 0.02 3.45
# "anyNA" 0.02 3.45 19.0 0.02 3.45
# "as.POSIXlt.POSIXct" 0.02 3.45 10.1 0.02 3.45
# "c" 0.02 3.45 19.8 0.02 3.45
# "is.na" 0.02 3.45 18.9 0.02 3.45
# "length" 0.02 3.45 13.8 0.02 3.45
# "mode" 0.02 3.45 16.6 0.02 3.45
# "%in%" 0.02 3.45 16.6 0.00 0.00
# ".deparseOpts" 0.02 3.45 19.0 0.00 0.00
# "as.Date" 0.02 3.45 10.1 0.00 0.00
# "as.POSIXlt" 0.02 3.45 10.1 0.00 0.00
# "Sys.Date" 0.02 3.45 10.1 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 0.58
n = 100
# $by.total
# total.time total.pct mem.total self.time self.pct
# "grow_df_loop" 1.74 98.86 963.0 0.00 0.00
# "rbind" 1.06 60.23 599.3 0.06 3.41
# "data.frame" 0.68 38.64 363.7 0.02 1.14
# "lapply" 0.50 28.41 239.0 0.04 2.27
# "replicate" 0.50 28.41 239.0 0.00 0.00
# "sapply" 0.50 28.41 239.0 0.00 0.00
# "paste" 0.46 26.14 218.4 0.06 3.41
# "FUN" 0.46 26.14 218.4 0.00 0.00
# "factor" 0.44 25.00 249.2 0.24 13.64
# "sample" 0.40 22.73 179.2 0.10 5.68
# "[<-" 0.38 21.59 244.3 0.00 0.00
# "[<-.factor" 0.34 19.32 229.5 0.30 17.05
# "c" 0.26 14.77 136.6 0.26 14.77
# "as.vector" 0.24 13.64 101.2 0.24 13.64
# "as.vector.factor" 0.24 13.64 101.2 0.00 0.00
# "order" 0.14 7.95 87.3 0.14 7.95
# "as.data.frame" 0.14 7.95 87.3 0.00 0.00
# "as.data.frame.character" 0.14 7.95 87.3 0.00 0.00
# "sample.int" 0.10 5.68 28.2 0.10 5.68
# "unique" 0.10 5.68 64.9 0.00 0.00
# "is.na" 0.06 3.41 62.4 0.06 3.41
# "unique.default" 0.04 2.27 42.4 0.04 2.27
# "[<-.Date" 0.04 2.27 14.9 0.00 0.00
# ".Call" 0.02 1.14 0.0 0.02 1.14
# "Make.row.names" 0.02 1.14 0.0 0.02 1.14
# "NextMethod" 0.02 1.14 0.0 0.02 1.14
# "structure" 0.02 1.14 10.3 0.02 1.14
# "unclass" 0.02 1.14 14.9 0.02 1.14
# ".Date" 0.02 1.14 0.0 0.00 0.00
# ".rs.enqueClientEvent" 0.02 1.14 0.0 0.00 0.00
# "as.Date" 0.02 1.14 23.2 0.00 0.00
# "as.Date.character" 0.02 1.14 23.2 0.00 0.00
# "as.Date.numeric" 0.02 1.14 23.2 0.00 0.00
# "charToDate" 0.02 1.14 23.2 0.00 0.00
# "hook" 0.02 1.14 0.0 0.00 0.00
# "is.na.POSIXlt" 0.02 1.14 23.2 0.00 0.00
# "utils::Rprof" 0.02 1.14 0.0 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 1.76
n = 500
# $by.total
# total.time total.pct mem.total self.time self.pct
# "grow_df_loop" 28.12 100.00 15557.7 0.00 0.00
# "rbind" 25.30 89.97 13418.5 3.06 10.88
# "factor" 8.94 31.79 5026.5 6.98 24.82
# "[<-" 8.72 31.01 4486.9 0.02 0.07
# "[<-.factor" 7.62 27.10 3915.5 7.32 26.03
# "unique" 3.06 10.88 2060.9 0.00 0.00
# "as.vector" 2.96 10.53 1250.1 2.96 10.53
# "as.vector.factor" 2.96 10.53 1250.1 0.00 0.00
# "data.frame" 2.82 10.03 2139.1 0.02 0.07
# "unique.default" 2.30 8.18 1657.9 2.30 8.18
# "replicate" 1.88 6.69 1364.7 0.00 0.00
# "sapply" 1.88 6.69 1364.7 0.00 0.00
# "FUN" 1.84 6.54 1367.2 0.18 0.64
# "lapply" 1.84 6.54 1338.8 0.02 0.07
# "paste" 1.70 6.05 1281.3 0.38 1.35
# "sample" 1.36 4.84 1089.2 0.20 0.71
# "[<-.Date" 1.08 3.84 571.4 0.00 0.00
# "c" 1.04 3.70 688.7 1.04 3.70
# ".Date" 0.96 3.41 488.0 0.34 1.21
# "sample.int" 0.76 2.70 584.2 0.74 2.63
# "as.data.frame" 0.70 2.49 533.6 0.00 0.00
# "as.data.frame.character" 0.64 2.28 476.0 0.00 0.00
# "NextMethod" 0.62 2.20 424.7 0.62 2.20
# "order" 0.60 2.13 475.5 0.50 1.78
# "structure" 0.32 1.14 155.5 0.32 1.14
# "is.na" 0.28 1.00 150.5 0.26 0.92
# "Make.row.names" 0.12 0.43 153.8 0.12 0.43
# "unclass" 0.12 0.43 83.3 0.12 0.43
# "as.Date" 0.10 0.36 120.1 0.02 0.07
# "length" 0.06 0.21 79.2 0.06 0.21
# "seq.int" 0.06 0.21 57.0 0.06 0.21
# "vapply" 0.06 0.21 84.6 0.02 0.07
# ":" 0.04 0.14 1.1 0.04 0.14
# "as.POSIXlt.POSIXct" 0.04 0.14 57.7 0.04 0.14
# "is.factor" 0.04 0.14 0.0 0.04 0.14
# "deparse" 0.04 0.14 55.0 0.02 0.07
# "eval" 0.04 0.14 36.2 0.02 0.07
# "match.arg" 0.04 0.14 25.2 0.02 0.07
# "match.fun" 0.04 0.14 32.4 0.02 0.07
# "as.data.frame.integer" 0.04 0.14 55.0 0.00 0.00
# "as.POSIXlt" 0.04 0.14 57.7 0.00 0.00
# "force" 0.04 0.14 55.0 0.00 0.00
# "make.names" 0.04 0.14 42.1 0.00 0.00
# "Sys.Date" 0.04 0.14 57.7 0.00 0.00
# "!" 0.02 0.07 29.6 0.02 0.07
# "$" 0.02 0.07 2.6 0.02 0.07
# "any" 0.02 0.07 18.3 0.02 0.07
# "as.data.frame.numeric" 0.02 0.07 2.6 0.02 0.07
# "as.data.frame.vector" 0.02 0.07 21.6 0.02 0.07
# "as.list" 0.02 0.07 26.6 0.02 0.07
# "baseenv" 0.02 0.07 25.2 0.02 0.07
# "is.ordered" 0.02 0.07 14.5 0.02 0.07
# "lengths" 0.02 0.07 14.9 0.02 0.07
# "levels" 0.02 0.07 0.0 0.02 0.07
# "mode" 0.02 0.07 30.7 0.02 0.07
# "names" 0.02 0.07 0.0 0.02 0.07
# "rnorm" 0.02 0.07 29.6 0.02 0.07
# "%in%" 0.02 0.07 30.7 0.00 0.00
# "as.Date.character" 0.02 0.07 2.6 0.00 0.00
# "as.Date.numeric" 0.02 0.07 2.6 0.00 0.00
# "as.POSIXct" 0.02 0.07 2.6 0.00 0.00
# "as.POSIXct.POSIXlt" 0.02 0.07 2.6 0.00 0.00
# "charToDate" 0.02 0.07 2.6 0.00 0.00
# "eval.parent" 0.02 0.07 11.0 0.00 0.00
# "is.na.POSIXlt" 0.02 0.07 2.6 0.00 0.00
# "simplify2array" 0.02 0.07 14.9 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 28.12
List Approach
n = 50
# $by.total
# total.time total.pct mem.total self.time self.pct
# "grow_df_list" 0.40 100 257.0 0.00 0
# "data.frame" 0.32 80 175.6 0.02 5
# "lapply" 0.32 80 175.6 0.02 5
# "FUN" 0.32 80 175.6 0.00 0
# "replicate" 0.24 60 129.6 0.00 0
# "sapply" 0.24 60 129.6 0.00 0
# "paste" 0.22 55 119.2 0.10 25
# "sample" 0.12 30 49.4 0.00 0
# "sample.int" 0.08 20 39.1 0.08 20
# "<Anonymous>" 0.08 20 81.4 0.00 0
# "do.call" 0.08 20 81.4 0.00 0
# "rbind" 0.08 20 81.4 0.00 0
# "factor" 0.06 15 29.7 0.02 5
# "as.data.frame" 0.06 15 29.7 0.00 0
# "as.data.frame.character" 0.06 15 29.7 0.00 0
# "c" 0.04 10 10.3 0.04 10
# "order" 0.04 10 17.3 0.04 10
# "unique.default" 0.04 10 31.1 0.04 10
# "[<-" 0.04 10 50.3 0.00 0
# "unique" 0.04 10 31.1 0.00 0
# ".Date" 0.02 5 27.9 0.02 5
# "[<-.factor" 0.02 5 22.4 0.02 5
# "[<-.Date" 0.02 5 27.9 0.00 0
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 0.4
n = 100
# $by.total
# total.time total.pct mem.total self.time self.pct
# "grow_df_list" 1.00 100 620.4 0.00 0
# "data.frame" 0.66 66 401.8 0.00 0
# "FUN" 0.66 66 401.8 0.00 0
# "lapply" 0.66 66 401.8 0.00 0
# "paste" 0.42 42 275.3 0.14 14
# "replicate" 0.42 42 275.3 0.00 0
# "sapply" 0.42 42 275.3 0.00 0
# "rbind" 0.34 34 218.6 0.02 2
# "<Anonymous>" 0.34 34 218.6 0.00 0
# "do.call" 0.34 34 218.6 0.00 0
# "sample" 0.28 28 188.6 0.08 8
# "unique.default" 0.20 20 90.1 0.20 20
# "unique" 0.20 20 90.1 0.00 0
# "as.data.frame" 0.18 18 81.2 0.00 0
# "factor" 0.16 16 81.2 0.02 2
# "as.data.frame.character" 0.16 16 81.2 0.00 0
# "[<-.factor" 0.14 14 112.0 0.14 14
# "sample.int" 0.14 14 96.8 0.14 14
# "[<-" 0.14 14 112.0 0.00 0
# "order" 0.12 12 51.2 0.12 12
# "c" 0.06 6 45.8 0.06 6
# "as.Date" 0.04 4 28.3 0.02 2
# "length" 0.02 2 17.0 0.02 2
# "strptime" 0.02 2 11.2 0.02 2
# "structure" 0.02 2 0.0 0.02 2
# "as.data.frame.integer" 0.02 2 0.0 0.00 0
# "as.Date.character" 0.02 2 11.2 0.00 0
# "as.Date.numeric" 0.02 2 11.2 0.00 0
# "charToDate" 0.02 2 11.2 0.00 0
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 1
n = 500
# $by.total
# total.time total.pct mem.total self.time self.pct
# "grow_df_list" 9.40 100.00 5621.8 0.00 0.00
# "rbind" 6.12 65.11 3633.5 0.44 4.68
# "<Anonymous>" 6.12 65.11 3633.5 0.00 0.00
# "do.call" 6.12 65.11 3633.5 0.00 0.00
# "lapply" 3.28 34.89 1988.3 0.34 3.62
# "FUN" 3.28 34.89 1988.3 0.10 1.06
# "data.frame" 3.28 34.89 1988.3 0.02 0.21
# "[<-" 3.28 34.89 2118.4 0.00 0.00
# "[<-.factor" 3.00 31.91 1829.1 3.00 31.91
# "replicate" 2.36 25.11 1422.9 0.00 0.00
# "sapply" 2.36 25.11 1422.9 0.00 0.00
# "unique" 2.32 24.68 1189.9 0.00 0.00
# "paste" 1.98 21.06 1194.2 0.70 7.45
# "unique.default" 1.96 20.85 1017.8 1.96 20.85
# "sample" 1.20 12.77 707.4 0.44 4.68
# "as.data.frame" 0.88 9.36 540.5 0.02 0.21
# "as.data.frame.character" 0.78 8.30 496.2 0.00 0.00
# "factor" 0.72 7.66 444.2 0.06 0.64
# "c" 0.68 7.23 379.6 0.68 7.23
# "order" 0.64 6.81 385.1 0.64 6.81
# "sample.int" 0.40 4.26 233.0 0.38 4.04
# ".Date" 0.28 2.98 289.3 0.10 1.06
# "[<-.Date" 0.28 2.98 289.3 0.00 0.00
# "NextMethod" 0.18 1.91 171.2 0.18 1.91
# "deparse" 0.08 0.85 54.6 0.02 0.21
# "%in%" 0.08 0.85 54.6 0.00 0.00
# "mode" 0.08 0.85 54.6 0.00 0.00
# "length" 0.06 0.64 10.4 0.06 0.64
# "structure" 0.06 0.64 30.8 0.04 0.43
# ".deparseOpts" 0.06 0.64 49.1 0.02 0.21
# "[[" 0.06 0.64 34.2 0.02 0.21
# ":" 0.04 0.43 33.6 0.04 0.43
# "[[.data.frame" 0.04 0.43 22.6 0.04 0.43
# "force" 0.04 0.43 20.0 0.00 0.00
# "as.vector" 0.02 0.21 0.0 0.02 0.21
# "is.na" 0.02 0.21 0.0 0.02 0.21
# "levels" 0.02 0.21 14.6 0.02 0.21
# "make.names" 0.02 0.21 9.4 0.02 0.21
# "pmatch" 0.02 0.21 17.3 0.02 0.21
# "as.data.frame.Date" 0.02 0.21 5.5 0.00 0.00
# "as.data.frame.integer" 0.02 0.21 0.0 0.00 0.00
# "as.data.frame.logical" 0.02 0.21 14.5 0.00 0.00
# "as.data.frame.numeric" 0.02 0.21 13.5 0.00 0.00
# "as.data.frame.vector" 0.02 0.21 17.3 0.00 0.00
# "simplify2array" 0.02 0.21 0.0 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 9.4
Graphs(using a different call to save $by.total
results)