I have a process where we currently large amount of data per day, perform a map-reduce level of functions and use only the output of the function. We currently run a code sequence that looks like the below
lapply(start_times, function(start_time){
<get_data>
<setofoperations>
}
so currently we loop through start times , which helps us get data for a particular day , analyse and output dataframes of results per output per day. set of operations is a series of functions that keep working on and return dataframes.
While running this on a docker container with a memory limit , we often see that the process runs out of memory when its dealing with large data (around 250-500MB) over periods of days and R isnt able to effectively do garbage collection.
Im trying an approach to monitor each process using cadvisor and notice spikes , but not really able to understand better.
If R does a lazy gc, ideally the process should be able to reuse the memory over and over, is there something that is not being captured through the gc process?
How can an R process reclaim more memory when its the only primary process running in the docker container ?