Problem
I work on a data.table where each row is a medical observation. The problem is there are some errors in my data, and I need to correct them before pursuit my analysis. For example, a male patient can have an observation where he is coded as a female.
Solution
My solution is to select the mode (most frequent value) of a variable by the patient. If a patient has 10 observations as a male, and one as female, it is safe to assume that he is a male.
I have found that clever way to do it with data.table.
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
The problem is that when a patient as multiple modes, it just keeps one. So a patient which is 50% male and 50% female will be counted as a male, which will lead to a bias in the end. I would like to code them as NA's.
The only way to correct this I founded is by using dplyr
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N))
and then replace SEX value by NA if duplicated. But it takes way longer than data.table, it is not very optimized, and I have a big data set with a lot of variables that would need to be corrected as well.
Resume
How do I took the mode of a variable by a patient and replace it by NA's if not unique?
Example
ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")
require(data.table)
DATA <- data.table(ID, SEX)
# First method (doesn't work)
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
# Second method (work with dplyr)
require(dplyr)
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N)) %>%
mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
true = "NA",
false = SEX)) %>%
filter(row_number() == n())
# Applied to my data it took 84.288 seconds