I need to replace missing values in all columns of a data frame within ID and time point for a subgroup that have data from several sources. If it is not too complicated, it would be best to prioritize data from source B (e.g., in case of id 2 for variable Y in the data below).
Using the code below, it currently works (without prioritizing) for one column at the time, but since its a large data frame with millions of rows, it needs to be automated further. Also, I would like to keep it within the data.table framework if possible. Any advice?
# Data
id time X Y Source
1 2005 67 NA A
1 2005 NA 1.1 B
1 2005 NA 1.1 B
2 2003 85 NA B
2 2003 NA 0.4 A
2 2003 85 0.5 B
# Desired output
id time X Y Source
1 2005 67 1.1 A
1 2005 67 1.1 B
1 2005 67 1.1 B
2 2003 85 0.5 B
2 2003 85 0.4 A
2 2003 85 0.5 B
# Find duplicates
dup <- (duplicated(dat[,c('id','time')])|duplicated(dat[,c('id','time')], fromLast=TRUE))
# Replace NA in column X
library(data.table)
dat[dup & is.na(X), X := dat[!is.na(X)][.SD, on=.(id,time), mult="last", X]]