Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201839

How to filter my data.table by condition and by group?

$
0
0

Problem

I work on a data.table where each row is a medical observation. The problem is there are some errors in my data, and I need to correct them before pursuit my analysis. For example, a male patient can have an observation where he is coded as a female.

Solution

My solution is to select the mode (most frequent value) of a variable by the patient. If a patient has 10 observations as a male, and one as female, it is safe to assume that he is a male.

I have found that clever way to do it with data.table.

DATA[j  = .N, 
     by = .(ID, SEX)][i = base::order(-N), 
     j = .(SEX = SEX[1L]), 
     keyby = ID]

The problem is that when a patient as multiple modes, it just keeps one. So a patient which is 50% male and 50% female will be counted as a male, which will lead to a bias in the end. I would like to code them as NA's.

The only way to correct this I founded is by using dplyr

DATA[j  = .N, 
     by = .(ID, SEX)] %>% 
     group_by(ID) %>% 
     filter(N == max(N))

and then replace SEX value by NA if duplicated. But it takes way longer than data.table, it is not very optimized, and I have a big data set with a lot of variables that would need to be corrected as well.

Resume

How do I took the mode of a variable by a patient and replace it by NA's if not unique?

Example

ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")

require(data.table)
DATA <- data.table(ID, SEX)

# First method (doesn't work)
DATA[j  = .N, 
     by = .(ID, SEX)][i = base::order(-N), 
     j = .(SEX = SEX[1L]), 
     keyby = ID]

# Second method (work with dplyr)
require(dplyr)
DATA[j  = .N, 
     by = .(ID, SEX)] %>% 
     group_by(ID) %>% 
     filter(N == max(N)) %>%
     mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
                          true = "NA",
                          false = SEX)) %>%
     filter(row_number() == n())

# Applied to my data it took 84.288 seconds

Viewing all articles
Browse latest Browse all 201839

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>