I have the following data.table
:
dt
# unique_id group_id primary_id ph1 ph2 ph3
# 1: 1 1 TRUE 07 03 <NA>
# 2: 2 1 FALSE 07 03 84
# 3: 3 2 FALSE 10 <NA> <NA>
# 4: 4 2 TRUE <NA> 10 <NA>
# 5: 5 2 FALSE <NA> <NA> 10
# 6: 6 3 FALSE 22 03 <NA>
# 7: 7 3 TRUE <NA> 13 03
unique_ids
are grouped by common phone numbers (ph1
, ph2
, ph3
) which are common across rows (e.g. in the first group "07", "03" are common across the group and in the third group, "03" is shared, but not in the same column, as per group 2).
Each group has 1 primary_id
.
Within each group I want to remove the common phone number element(s) in the non primary_id's and retain it for the primary id, so they are no longer linked.
I can achieve this easily in a for loop, however, it's across millions of groups and it's extremely slow.
Looking for a quicker method.
Data:
library(data.table)
dt <- data.table(structure(list(unique_id = c(1, 2, 3, 4, 5, 6, 7), group_id = c(1,
1, 2, 2, 2, 3, 3), primary_id = c(TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE), ph1 = c("07", "07", "10", NA, NA, "22", NA), ph2 = c("03",
"03", NA, "10", NA, "03", "13"), ph3 = c(NA, "84", NA, NA, "10",
NA, "03")), class = "data.frame", row.names = c(NA, -7L))
)
Desired output is:
output <- data.table(structure(list(unique_id = c(1, 2, 3, 4, 5, 6, 7), group_id = c(1,
1, 2, 2, 2, 3, 3), primary_id = c(TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE), ph1 = c("07", NA, NA, NA, NA, "22", NA), ph2 = c("03",
NA, NA, "10", NA, NA, "13"), ph3 = c(NA, "84", NA, NA, NA, NA,
"03")), class = "data.frame", row.names = c(NA, -7L)))
output
# unique_id group_id primary_id ph1 ph2 ph3
# 1: 1 1 TRUE 07 03 <NA>
# 2: 2 1 FALSE <NA> <NA> 84
# 3: 3 2 FALSE <NA> <NA> <NA>
# 4: 4 2 TRUE <NA> 10 <NA>
# 5: 5 2 FALSE <NA> <NA> <NA>
# 6: 6 3 FALSE 22 <NA> <NA>
# 7: 7 3 TRUE <NA> 13 03
If still unclear, it may be easier to visualize it like this:
![enter image description here]()