I'd like to find the closest match (smallest difference) of a variable between two groups, but if the closest match has already been made, move on to the next closest match, until n number of matches have been made.
I used the code from this answer (below) to find the closest match of a value
between Samples
for each pairwise grouping of all groups (i.e. Location
by VAR
).
However, there are many repeats, and the top match for Sample.x
1, 2, and 3, might all be Sample.y
1.
What I'd like to instead is find the next closest match for Sample.x
2, then 3, etc. until I specified number of distinct (Sample.x
-Sample.y
) matches have been made. But the order of Sample.x
is not important, I'm just looking for the top n matches between Sample.x
and Sample.y
for a given grouping.
I attempted to do this with dplyr::distinct
as shown below. But I am unsure how to use the distinct entries for Sample.y
to filter the dataframe and then again by smallest DIFF
. However, this won't necessarily result in unique Sample
pairings.
Is there a smart way to accomplish this in R with dplyr? Is there a name for this type of operation?
df01 <- data.frame(Location = rep(c("A", "C"), each =10),
Sample = rep(c(1:10), times =2),
Var1 = signif(runif(20, 55, 58), digits=4),
Var2 = rep(c(1:10), times =2))
df001 <- data.frame(Location = rep(c("B"), each =10),
Sample = rep(c(1:10), times =1),
Var1 = c(1.2, 1.3, 1.4, 1.6, 56, 110.1, 111.6, 111.7, 111.8, 120.5),
Var2 = c(1.5, 10.1, 10.2, 11.7, 12.5, 13.6, 14.4, 18.1, 20.9, 21.3))
df <- rbind(df01, df001)
dfl <- df %>% gather(VAR, value, 3:4)
df.result <- df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data dfl
map(~ full_join(dfl %>%
filter(Location == first(.x)),
dfl %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
# here I choose 5,
# and then hope that this is enough to get a smaller n of final matches
top_n(-5, DIFF) %>%
mutate(GG = paste(Location.x, Location.y, sep="-")))
res1 <- rbindlist(df.result)
res2 <- res1 %>% group_by(GG, VAR) %>% distinct(Sample.y)
res3 <- res2 %>% group_by(GG, VAR) %>% top_n(-2, DIFF)