I am trying to achieve an sliding window aggregation. I tried something using tidyr
functions but I am sure there are much better / faster ways to achieve.
Let me explain what I want to achieve:
I have an input dataframe dat
:
dat <- tibble(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"), as.POSIXct("2019-01-01 02:00:00"), by = "15 min"))
set.seed(42)
dat$value <- sample(1:5, nrow(dat), replace = T)
dat
# A tibble: 9 x 2
timestamp value
<dttm> <int>
1 2019-01-01 00:00:00 5
2 2019-01-01 00:15:00 5
3 2019-01-01 00:30:00 2
4 2019-01-01 00:45:00 5
5 2019-01-01 01:00:00 4
6 2019-01-01 01:15:00 3
7 2019-01-01 01:30:00 4
8 2019-01-01 01:45:00 1
9 2019-01-01 02:00:00 4
For every row, I want to find the list of unique values from the value
field (but ignore itself if present) that appeared in the next 60 minutes. Lets call that list as nextvalue
Then expand each row to generate pairs between the value
and the nextvalue
. Then group_by
, value
and nextvalue
and summarise
the counts and sort by descending order.
I read the docs and have put the below code.
t <- dat$timestamp
value <- dat$value
getCI <- function(start, end) {
paste(value[(start+1):end], collapse = "|")
}
LETTERS <- LETTERS[1:(length(unique(value)) - 1)]
dat %>%
mutate(time_next = timestamp + 60*60) %>%
rowwise() %>%
mutate(flag = max(which(time_next >= t))) %>%
ungroup() %>%
mutate(row = row_number()) %>%
rowwise() %>%
mutate(nextvalue = getCI(row, flag)) %>%
select(value, nextvalue) %>%
separate(nextvalue, c(LETTERS), extra = "warn", fill = "right") %>%
pivot_longer(LETTERS, names_to = c("Letter"), values_to = "nextvalue") %>%
filter(!is.na(nextvalue)) %>%
filter(value != nextvalue) %>%
select(value, nextvalue) %>%
group_by(value, nextvalue) %>%
summarise(count = n()) %>%
arrange(desc(count))
# A tibble: 13 x 3
# Groups: value [5]
value nextvalue count
<int> <chr> <int>
1 5 4 4
2 2 4 2
3 3 4 2
4 4 1 2
5 5 2 2
6 5 3 2
7 1 4 1
8 2 3 1
9 2 5 1
10 3 1 1
11 4 3 1
12 4 NA 1
13 5 1 1
But I want to see interesting ways to achieve this in much less code and much simpler way. I would be interested in seeing how multicore approaches can be applied to this problem to speed up the entire computation. Please comment