Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201945

Can timing or memory usage illustrate the hazard of growing objects in a loop?

$
0
0

Occasionally, we find novice R programmers build data frames in a for loop, usually by initializing an empty data frame and then iteratively calling rbind. To respond to this inefficient approach, we often cite Patrick Burns'R Inferno - Circle 2: Growing Objects who emphasizes the hazard of this situation.

In Python pandas (the other open-source data science tool), experts have asserted the quadratic copy and O(N^2) logic: (@unutbu here, @Alexander here). Additionally, docs (see section note) stress the copying problem of datasets and wiki explains Python's list.append does not have the copy problem. I wonder if similar constructs apply to R.

Specifically, my question:

  • Can timing alone illustrate or quantify the growing object in loop problem? See microbenchmark results below. Burns shows timings to illustrate the computational challenge to create a sequence.
  • Or does memory usage illustrate or quantify the growing object in loop problem? See RProf results below. Burns cites using RProf to show memory consumption within code.
  • Or is the growing object problem, context-specific, with general rule of thumb to avoid loops in building objects?

Consider following examples of growing a random data frame of 500 rows in a loop and using a list:

grow_df_loop <- function(n) {
  final_df <- data.frame()

  for(i in 1:n) {
    df <- data.frame(
      group = sample(c("sas", "stata", "spss", "python", "r", "julia"), 500, replace=TRUE),
      int = sample(1:15, 500, replace=TRUE),
      num = rnorm(500),
      char = replicate(500, paste(sample(c(LETTERS, letters, c(0:9)), 3, replace=TRUE), collapse="")),
      bool = sample(c(TRUE, FALSE), 500, replace=TRUE),
      date = as.Date(sample(10957:as.integer(Sys.Date()), 500, replace=TRUE), origin="1970-01-01")
    )        
    final_df <- rbind(final_df, df)
  }

  return(final_df)
}

grow_df_list <- function(n) {
  df_list <- lapply(1:n, function(i)
    df <- data.frame(
      group = sample(c("sas", "stata", "spss", "python", "r", "julia"), 500, replace=TRUE),
      int = sample(1:15, 500, replace=TRUE),
      num = rnorm(500),
      char = replicate(500, paste(sample(c(LETTERS, letters, c(0:9)), 3, replace=TRUE), collapse="")),
      bool = sample(c(TRUE, FALSE), 500, replace=TRUE),
      date = as.Date(sample(10957:as.integer(Sys.Date()), 500, replace=TRUE), origin="1970-01-01")
    )
  )

  final_df <- do.call(rbind, df_list)
  return(final_df)
}

Timing

Benchmarking by timing confirms the list approach is more efficient across the different number of iterations. But given reproducible, uniform data examples can timing results capture the difference of object growth?

library(microbenchmark)

microbenchmark(grow_df_loop(50), grow_df_list(50), times = 5L)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval cld
#  grow_df_loop(50) 758.2412 762.3489 809.8988 793.3590 806.4191 929.1256     5   b
#  grow_df_list(50) 554.3722 562.1949 577.6891 568.7658 589.8565 613.2560     5  a 

microbenchmark(grow_df_loop(100), grow_df_list(100), times = 5L)
# Unit: seconds
#               expr      min       lq     mean   median       uq      max neval cld
#  grow_df_loop(100) 2.223617 2.225441 2.425668 2.233529 2.677309 2.768447     5   b
#  grow_df_list(100) 1.211181 1.255191 1.325670 1.287821 1.396905 1.477252     5  a 

microbenchmark(grow_df_loop(500), grow_df_list(500), times = 5L)
# Unit: seconds
#               expr      min       lq     mean   median       uq      max neval cld
#  grow_df_loop(500) 38.78245 39.74367 41.54976 40.10221 44.36565 44.75483     5   b
#  grow_df_list(500) 13.37076 13.90227 14.67498 14.53042 15.49942 16.07203     5  a

Memory Usage

Additionally, profiling by memory shows "rbind" memory totals sizeably growing with iteration size but more pronounced with loop approach than list approach. Given a reproducible, uniform example can mem.total results capture the difference of object growth? Any other approach to use?

Loop Approach

n = 50

utils::Rprof(tmp <- tempfile(), memory.profiling = TRUE)
output_df1 <- grow_df_loop(50)
utils::Rprof(NULL)
summaryRprof(tmp, memory="both")
unlink(tmp)

# $by.total
#                           total.time total.pct mem.total self.time self.pct
# "grow_df_loop"                  0.58    100.00     349.1      0.00     0.00
# "data.frame"                    0.38     65.52     209.4      0.00     0.00
# "paste"                         0.28     48.28     186.4      0.06    10.34
# "FUN"                           0.26     44.83     150.8      0.02     3.45
# "lapply"                        0.26     44.83     150.8      0.00     0.00
# "replicate"                     0.26     44.83     150.8      0.00     0.00
# "sapply"                        0.26     44.83     150.8      0.00     0.00
# "sample"                        0.20     34.48     131.4      0.08    13.79
# "rbind"                         0.20     34.48     139.7      0.00     0.00
# "[<-.factor"                    0.12     20.69      66.0      0.10    17.24
# "[<-"                           0.12     20.69      66.0      0.00     0.00
# "factor"                        0.10     17.24      47.8      0.04     6.90
# "as.data.frame"                 0.10     17.24      48.5      0.00     0.00
# "as.data.frame.character"       0.10     17.24      48.5      0.00     0.00
# "order"                         0.06     10.34      12.9      0.06    10.34
# "as.vector"                     0.04      6.90      38.7      0.04     6.90
# "sample.int"                    0.04      6.90      18.7      0.02     3.45
# "as.vector.factor"              0.04      6.90      38.7      0.00     0.00
# "deparse"                       0.04      6.90      35.6      0.00     0.00
# "!"                             0.02      3.45      18.7      0.02     3.45
# ":"                             0.02      3.45       0.0      0.02     3.45
# "anyNA"                         0.02      3.45      19.0      0.02     3.45
# "as.POSIXlt.POSIXct"            0.02      3.45      10.1      0.02     3.45
# "c"                             0.02      3.45      19.8      0.02     3.45
# "is.na"                         0.02      3.45      18.9      0.02     3.45
# "length"                        0.02      3.45      13.8      0.02     3.45
# "mode"                          0.02      3.45      16.6      0.02     3.45
# "%in%"                          0.02      3.45      16.6      0.00     0.00
# ".deparseOpts"                  0.02      3.45      19.0      0.00     0.00
# "as.Date"                       0.02      3.45      10.1      0.00     0.00
# "as.POSIXlt"                    0.02      3.45      10.1      0.00     0.00
# "Sys.Date"                      0.02      3.45      10.1      0.00     0.00
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 0.58

n = 100

# $by.total
#                           total.time total.pct mem.total self.time self.pct
# "grow_df_loop"                  1.74     98.86     963.0      0.00     0.00
# "rbind"                         1.06     60.23     599.3      0.06     3.41
# "data.frame"                    0.68     38.64     363.7      0.02     1.14
# "lapply"                        0.50     28.41     239.0      0.04     2.27
# "replicate"                     0.50     28.41     239.0      0.00     0.00
# "sapply"                        0.50     28.41     239.0      0.00     0.00
# "paste"                         0.46     26.14     218.4      0.06     3.41
# "FUN"                           0.46     26.14     218.4      0.00     0.00
# "factor"                        0.44     25.00     249.2      0.24    13.64
# "sample"                        0.40     22.73     179.2      0.10     5.68
# "[<-"                           0.38     21.59     244.3      0.00     0.00
# "[<-.factor"                    0.34     19.32     229.5      0.30    17.05
# "c"                             0.26     14.77     136.6      0.26    14.77
# "as.vector"                     0.24     13.64     101.2      0.24    13.64
# "as.vector.factor"              0.24     13.64     101.2      0.00     0.00
# "order"                         0.14      7.95      87.3      0.14     7.95
# "as.data.frame"                 0.14      7.95      87.3      0.00     0.00
# "as.data.frame.character"       0.14      7.95      87.3      0.00     0.00
# "sample.int"                    0.10      5.68      28.2      0.10     5.68
# "unique"                        0.10      5.68      64.9      0.00     0.00
# "is.na"                         0.06      3.41      62.4      0.06     3.41
# "unique.default"                0.04      2.27      42.4      0.04     2.27
# "[<-.Date"                      0.04      2.27      14.9      0.00     0.00
# ".Call"                         0.02      1.14       0.0      0.02     1.14
# "Make.row.names"                0.02      1.14       0.0      0.02     1.14
# "NextMethod"                    0.02      1.14       0.0      0.02     1.14
# "structure"                     0.02      1.14      10.3      0.02     1.14
# "unclass"                       0.02      1.14      14.9      0.02     1.14
# ".Date"                         0.02      1.14       0.0      0.00     0.00
# ".rs.enqueClientEvent"          0.02      1.14       0.0      0.00     0.00
# "as.Date"                       0.02      1.14      23.2      0.00     0.00
# "as.Date.character"             0.02      1.14      23.2      0.00     0.00
# "as.Date.numeric"               0.02      1.14      23.2      0.00     0.00
# "charToDate"                    0.02      1.14      23.2      0.00     0.00
# "hook"                          0.02      1.14       0.0      0.00     0.00
# "is.na.POSIXlt"                 0.02      1.14      23.2      0.00     0.00
# "utils::Rprof"                  0.02      1.14       0.0      0.00     0.00
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 1.76

n = 500

# $by.total
#                           total.time total.pct mem.total self.time self.pct
# "grow_df_loop"                 28.12    100.00   15557.7      0.00     0.00
# "rbind"                        25.30     89.97   13418.5      3.06    10.88
# "factor"                        8.94     31.79    5026.5      6.98    24.82
# "[<-"                           8.72     31.01    4486.9      0.02     0.07
# "[<-.factor"                    7.62     27.10    3915.5      7.32    26.03
# "unique"                        3.06     10.88    2060.9      0.00     0.00
# "as.vector"                     2.96     10.53    1250.1      2.96    10.53
# "as.vector.factor"              2.96     10.53    1250.1      0.00     0.00
# "data.frame"                    2.82     10.03    2139.1      0.02     0.07
# "unique.default"                2.30      8.18    1657.9      2.30     8.18
# "replicate"                     1.88      6.69    1364.7      0.00     0.00
# "sapply"                        1.88      6.69    1364.7      0.00     0.00
# "FUN"                           1.84      6.54    1367.2      0.18     0.64
# "lapply"                        1.84      6.54    1338.8      0.02     0.07
# "paste"                         1.70      6.05    1281.3      0.38     1.35
# "sample"                        1.36      4.84    1089.2      0.20     0.71
# "[<-.Date"                      1.08      3.84     571.4      0.00     0.00
# "c"                             1.04      3.70     688.7      1.04     3.70
# ".Date"                         0.96      3.41     488.0      0.34     1.21
# "sample.int"                    0.76      2.70     584.2      0.74     2.63
# "as.data.frame"                 0.70      2.49     533.6      0.00     0.00
# "as.data.frame.character"       0.64      2.28     476.0      0.00     0.00
# "NextMethod"                    0.62      2.20     424.7      0.62     2.20
# "order"                         0.60      2.13     475.5      0.50     1.78
# "structure"                     0.32      1.14     155.5      0.32     1.14
# "is.na"                         0.28      1.00     150.5      0.26     0.92
# "Make.row.names"                0.12      0.43     153.8      0.12     0.43
# "unclass"                       0.12      0.43      83.3      0.12     0.43
# "as.Date"                       0.10      0.36     120.1      0.02     0.07
# "length"                        0.06      0.21      79.2      0.06     0.21
# "seq.int"                       0.06      0.21      57.0      0.06     0.21
# "vapply"                        0.06      0.21      84.6      0.02     0.07
# ":"                             0.04      0.14       1.1      0.04     0.14
# "as.POSIXlt.POSIXct"            0.04      0.14      57.7      0.04     0.14
# "is.factor"                     0.04      0.14       0.0      0.04     0.14
# "deparse"                       0.04      0.14      55.0      0.02     0.07
# "eval"                          0.04      0.14      36.2      0.02     0.07
# "match.arg"                     0.04      0.14      25.2      0.02     0.07
# "match.fun"                     0.04      0.14      32.4      0.02     0.07
# "as.data.frame.integer"         0.04      0.14      55.0      0.00     0.00
# "as.POSIXlt"                    0.04      0.14      57.7      0.00     0.00
# "force"                         0.04      0.14      55.0      0.00     0.00
# "make.names"                    0.04      0.14      42.1      0.00     0.00
# "Sys.Date"                      0.04      0.14      57.7      0.00     0.00
# "!"                             0.02      0.07      29.6      0.02     0.07
# "$"                             0.02      0.07       2.6      0.02     0.07
# "any"                           0.02      0.07      18.3      0.02     0.07
# "as.data.frame.numeric"         0.02      0.07       2.6      0.02     0.07
# "as.data.frame.vector"          0.02      0.07      21.6      0.02     0.07
# "as.list"                       0.02      0.07      26.6      0.02     0.07
# "baseenv"                       0.02      0.07      25.2      0.02     0.07
# "is.ordered"                    0.02      0.07      14.5      0.02     0.07
# "lengths"                       0.02      0.07      14.9      0.02     0.07
# "levels"                        0.02      0.07       0.0      0.02     0.07
# "mode"                          0.02      0.07      30.7      0.02     0.07
# "names"                         0.02      0.07       0.0      0.02     0.07
# "rnorm"                         0.02      0.07      29.6      0.02     0.07
# "%in%"                          0.02      0.07      30.7      0.00     0.00
# "as.Date.character"             0.02      0.07       2.6      0.00     0.00
# "as.Date.numeric"               0.02      0.07       2.6      0.00     0.00
# "as.POSIXct"                    0.02      0.07       2.6      0.00     0.00
# "as.POSIXct.POSIXlt"            0.02      0.07       2.6      0.00     0.00
# "charToDate"                    0.02      0.07       2.6      0.00     0.00
# "eval.parent"                   0.02      0.07      11.0      0.00     0.00
# "is.na.POSIXlt"                 0.02      0.07       2.6      0.00     0.00
# "simplify2array"                0.02      0.07      14.9      0.00     0.00
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 28.12

List Approach

n = 50

# $by.total
#                           total.time total.pct mem.total self.time self.pct
# "grow_df_list"                  0.40       100     257.0      0.00        0
# "data.frame"                    0.32        80     175.6      0.02        5
# "lapply"                        0.32        80     175.6      0.02        5
# "FUN"                           0.32        80     175.6      0.00        0
# "replicate"                     0.24        60     129.6      0.00        0
# "sapply"                        0.24        60     129.6      0.00        0
# "paste"                         0.22        55     119.2      0.10       25
# "sample"                        0.12        30      49.4      0.00        0
# "sample.int"                    0.08        20      39.1      0.08       20
# "<Anonymous>"                   0.08        20      81.4      0.00        0
# "do.call"                       0.08        20      81.4      0.00        0
# "rbind"                         0.08        20      81.4      0.00        0
# "factor"                        0.06        15      29.7      0.02        5
# "as.data.frame"                 0.06        15      29.7      0.00        0
# "as.data.frame.character"       0.06        15      29.7      0.00        0
# "c"                             0.04        10      10.3      0.04       10
# "order"                         0.04        10      17.3      0.04       10
# "unique.default"                0.04        10      31.1      0.04       10
# "[<-"                           0.04        10      50.3      0.00        0
# "unique"                        0.04        10      31.1      0.00        0
# ".Date"                         0.02         5      27.9      0.02        5
# "[<-.factor"                    0.02         5      22.4      0.02        5
# "[<-.Date"                      0.02         5      27.9      0.00        0
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 0.4

n = 100

# $by.total
#                           total.time total.pct mem.total self.time self.pct
# "grow_df_list"                  1.00       100     620.4      0.00        0
# "data.frame"                    0.66        66     401.8      0.00        0
# "FUN"                           0.66        66     401.8      0.00        0
# "lapply"                        0.66        66     401.8      0.00        0
# "paste"                         0.42        42     275.3      0.14       14
# "replicate"                     0.42        42     275.3      0.00        0
# "sapply"                        0.42        42     275.3      0.00        0
# "rbind"                         0.34        34     218.6      0.02        2
# "<Anonymous>"                   0.34        34     218.6      0.00        0
# "do.call"                       0.34        34     218.6      0.00        0
# "sample"                        0.28        28     188.6      0.08        8
# "unique.default"                0.20        20      90.1      0.20       20
# "unique"                        0.20        20      90.1      0.00        0
# "as.data.frame"                 0.18        18      81.2      0.00        0
# "factor"                        0.16        16      81.2      0.02        2
# "as.data.frame.character"       0.16        16      81.2      0.00        0
# "[<-.factor"                    0.14        14     112.0      0.14       14
# "sample.int"                    0.14        14      96.8      0.14       14
# "[<-"                           0.14        14     112.0      0.00        0
# "order"                         0.12        12      51.2      0.12       12
# "c"                             0.06         6      45.8      0.06        6
# "as.Date"                       0.04         4      28.3      0.02        2
# "length"                        0.02         2      17.0      0.02        2
# "strptime"                      0.02         2      11.2      0.02        2
# "structure"                     0.02         2       0.0      0.02        2
# "as.data.frame.integer"         0.02         2       0.0      0.00        0
# "as.Date.character"             0.02         2      11.2      0.00        0
# "as.Date.numeric"               0.02         2      11.2      0.00        0
# "charToDate"                    0.02         2      11.2      0.00        0
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 1

n = 500

# $by.total
#                           total.time total.pct mem.total self.time self.pct
# "grow_df_list"                  9.40    100.00    5621.8      0.00     0.00
# "rbind"                         6.12     65.11    3633.5      0.44     4.68
# "<Anonymous>"                   6.12     65.11    3633.5      0.00     0.00
# "do.call"                       6.12     65.11    3633.5      0.00     0.00
# "lapply"                        3.28     34.89    1988.3      0.34     3.62
# "FUN"                           3.28     34.89    1988.3      0.10     1.06
# "data.frame"                    3.28     34.89    1988.3      0.02     0.21
# "[<-"                           3.28     34.89    2118.4      0.00     0.00
# "[<-.factor"                    3.00     31.91    1829.1      3.00    31.91
# "replicate"                     2.36     25.11    1422.9      0.00     0.00
# "sapply"                        2.36     25.11    1422.9      0.00     0.00
# "unique"                        2.32     24.68    1189.9      0.00     0.00
# "paste"                         1.98     21.06    1194.2      0.70     7.45
# "unique.default"                1.96     20.85    1017.8      1.96    20.85
# "sample"                        1.20     12.77     707.4      0.44     4.68
# "as.data.frame"                 0.88      9.36     540.5      0.02     0.21
# "as.data.frame.character"       0.78      8.30     496.2      0.00     0.00
# "factor"                        0.72      7.66     444.2      0.06     0.64
# "c"                             0.68      7.23     379.6      0.68     7.23
# "order"                         0.64      6.81     385.1      0.64     6.81
# "sample.int"                    0.40      4.26     233.0      0.38     4.04
# ".Date"                         0.28      2.98     289.3      0.10     1.06
# "[<-.Date"                      0.28      2.98     289.3      0.00     0.00
# "NextMethod"                    0.18      1.91     171.2      0.18     1.91
# "deparse"                       0.08      0.85      54.6      0.02     0.21
# "%in%"                          0.08      0.85      54.6      0.00     0.00
# "mode"                          0.08      0.85      54.6      0.00     0.00
# "length"                        0.06      0.64      10.4      0.06     0.64
# "structure"                     0.06      0.64      30.8      0.04     0.43
# ".deparseOpts"                  0.06      0.64      49.1      0.02     0.21
# "[["                            0.06      0.64      34.2      0.02     0.21
# ":"                             0.04      0.43      33.6      0.04     0.43
# "[[.data.frame"                 0.04      0.43      22.6      0.04     0.43
# "force"                         0.04      0.43      20.0      0.00     0.00
# "as.vector"                     0.02      0.21       0.0      0.02     0.21
# "is.na"                         0.02      0.21       0.0      0.02     0.21
# "levels"                        0.02      0.21      14.6      0.02     0.21
# "make.names"                    0.02      0.21       9.4      0.02     0.21
# "pmatch"                        0.02      0.21      17.3      0.02     0.21
# "as.data.frame.Date"            0.02      0.21       5.5      0.00     0.00
# "as.data.frame.integer"         0.02      0.21       0.0      0.00     0.00
# "as.data.frame.logical"         0.02      0.21      14.5      0.00     0.00
# "as.data.frame.numeric"         0.02      0.21      13.5      0.00     0.00
# "as.data.frame.vector"          0.02      0.21      17.3      0.00     0.00
# "simplify2array"                0.02      0.21       0.0      0.00     0.00
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 9.4

Graphs(using a different call to save $by.total results)

Profile Results Rbind PlotProfile Results Data.Frame PlotProfile Results Full Function Plot


Viewing all articles
Browse latest Browse all 201945

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>