This question is nearly identical to: Create new group based on cumulative sum and group
However, when I apply the accepted solution to my data, it doesn't have the expected result.
In a nutshell, I have a data with two variables: domain
and value
. Domain
is a group variable with multiple observations and value
is some continuous value that I would like to accumulate by domain
and great a new group variable, newgroup
. There are three main rules:
- I accumulate only within each
domain
. If I reach the end of thedomain
, then the accumulation is reset. - If the accumulated sum is at least 1.0 then the observations whose values added up to at least 1.0 are assigned to a different value for
group1
. Please note that this rule can be satisfied by a single observation. - If the last group in a
domain
has an accumulated sum less than 1.0, then merge that with the second to last group within the samedomain
. This is reflected in the variablegroup2
The data below has been simplified. The data will usually consist of 10^5 - 10^6 rows, so a vectorized solution would be ideal.
Example Data
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
domain value
1 1.0
1 0.0
1 2.0
1 2.5
1 0.1
2 0.1
2 0.5
2 0.0
2 0.2
2 0.6
2 0.0
2 0.0
2 0.1
Desired Output
cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)
domain value cumsum_val group1 group2
1 1.0 1.0 1 1
1 0.0 0.0 2 2
1 2.0 2.0 2 2
1 2.5 2.5 3 3
1 0.1 0.1 4 3
2 0.1 0.1 5 4
2 0.5 0.6 5 4
2 0.0 0.6 5 4
2 0.2 0.8 5 4
2 0.6 1.4 5 4
2 0.0 0.0 6 4
2 0.0 0.0 6 4
2 0.1 0.1 6 4
I used the following code:
sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9
but the last line does not produce the same values as group1
above. Generating group1
output is what is mainly causing me issues. Can someone help me understand the function is_start
and how that is supposed to produce the groupings?
EDITakrun
provided some working code in the comments for the simplified example above. However, there are still some situations where it doesn't work. For example,
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
The output is show below with new
coming from akrun
's code and group1
and group2
are the desired groupings based on rules #2 and #3. The discrepancy between new
and group2
occurs mainly in the first 3 rows.
domain value new group1 group2
1 1.0 1 1 1
1 0.0 2 2 2
1 1.0 3 2 2
1 0.0 4 3 3
1 2.0 4 3 3
1 2.5 5 4 4
1 0.1 5 5 4
2 0.1 6 6 5
2 0.5 6 6 5
2 0.0 6 6 5
2 0.2 6 6 5
2 0.6 6 6 5
2 0.0 6 7 5
2 0.0 6 7 5
2 0.1 6 7 5