I have two tables that have a many to many relationship by 2 keys.
One of the tables contains NA
values in one of the keys. These NA
values typically appear when the other table has only one existing value for that key.
I would like to join by two keys when the value is not NA
and by one single key when the 2nd key is NA
At this moment I have a two step approach but I'm wondering if there is a better way to do it.
Here my reproducible example:
library(data.table)
set.seed(14)
dt1 <-
data.table(
key1 = c("A", "A", "B", "B", "C"),
key2 = c("A-opt1", "A-opt2", "B-opt1", "B-opt1", "C-opt1"),
measure_1 = rpois(5, 2)
)
print(dt1)
#> key1 key2 measure_1
#> 1: A A-opt1 1
#> 2: A A-opt2 2
#> 3: B B-opt1 5
#> 4: B B-opt1 2
#> 5: C C-opt1 5
dt2 <-
data.table(
id = c(1:5),
key1 = c("A", "A", "A", "B", "C"),
key2 = c("A-opt1", "A-opt2", "A-opt2", NA, NA),
measure_2 = rnorm(5)
)
print(dt2)
#> id key1 key2 measure_2
#> 1: 1 A A-opt1 0.0287647
#> 2: 2 A A-opt2 -0.1803785
#> 3: 3 A A-opt2 -0.3011443
#> 4: 4 B <NA> -0.9790001
#> 5: 5 C <NA> 1.0416423
# This is my current two step approach
result <- dt1[dt2, on = .(key1 == key1, key2 == key2), nomatch = 0L]
result <-
rbind(result, dt1[dt2[is.na(key2)], on = .(key1 == key1), nomatch = 0L][, .SD, .SDcols = names(result)])
print(result)
#> key1 key2 measure_1 id measure_2
#> 1: A A-opt1 1 1 0.0287647
#> 2: A A-opt2 2 2 -0.1803785
#> 3: A A-opt2 2 3 -0.3011443
#> 4: B B-opt1 5 4 -0.9790001
#> 5: B B-opt1 2 4 -0.9790001
#> 6: C C-opt1 5 5 1.0416423
Created on 2019-11-13 by the reprex package (v0.3.0)