I have a dataframe
with a bunch of start
and end
dates and I am looping through a list of dates and seeing how many rows in my dataframe are 'open' during that date on the list (i.e. the start date has happened but the end date hasn't).
I am curently doing this using lapply
but I was wondering if it could be done in dplyr
instead and if there is any benefit in terms of memory and speed (actual dataframe is 1.5M rows).
RollingDateRange <- seq(Sys.Date()-15, Sys.Date(), by="days")
temp <- data.frame(RollingDateRange)
dat <- data.frame(
Order = c(1,1,1,2,2,2,3,3,3),
Code = c("Green","Yellow","Blue","Yellow","Yellow","Red","Purple","Green","Blue"),
Start.Date = as.Date(c("2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03")),
End.Date = as.Date(c("2020-02-02","2020-02-08",NA,"2020-02-07","2020-02-06",NA,"2020-02-03","2020-02-08","2020-02-06")),
Count = c(1,1,1,1,1,1,1,1,1),
stringsAsFactors = FALSE)
temp$Count <- lapply(temp$RollingDateRange, function(d){
b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]
total <- sum(b$Count, na.rm = TRUE)
})
Output:
> temp
RollingDateRange Count
1 2020-01-25 0
2 2020-01-26 0
3 2020-01-27 0
4 2020-01-28 0
5 2020-01-29 0
6 2020-01-30 0
7 2020-01-31 0
8 2020-02-01 3
9 2020-02-02 6
10 2020-02-03 8
11 2020-02-04 7
12 2020-02-05 7
13 2020-02-06 7
14 2020-02-07 5
15 2020-02-08 4
16 2020-02-09 2