Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 205301

Find closest match, then next closest, between groups until a specified number of matches has been made

$
0
0

I'd like to find the closest match (smallest difference) of a variable between two groups, but if the closest match has already been made, move on to the next closest match, until n number of matches have been made.

I used the code from this answer (below) to find the closest match of a value between Samplesfor each pairwise grouping of all groups (i.e. Location by VAR).

However, there are many repeats, and the top match for Sample.x 1, 2, and 3, might all be Sample.y 1.

What I'd like to instead is find the next closest match for Sample.x 2, then 3, etc. until I specified number of distinct (Sample.x-Sample.y) matches have been made. But the order of Sample.x is not important, I'm just looking for the top n matches between Sample.x and Sample.y for a given grouping.

I attempted to do this with dplyr::distinct as shown below. But I am unsure how to use the distinct entries for Sample.y to filter the dataframe and then again by smallest DIFF. However, this won't necessarily result in unique Sample pairings.

Is there a smart way to accomplish this in R with dplyr? Is there a name for this type of operation?

 df01 <- data.frame(Location = rep(c("A", "C"), each =10), 
                   Sample = rep(c(1:10), times =2),
                   Var1 =  signif(runif(20, 55, 58), digits=4),
                   Var2 = rep(c(1:10), times =2)) 
df001 <- data.frame(Location = rep(c("B"), each =10), 
                    Sample = rep(c(1:10), times =1),
                    Var1 = c(1.2, 1.3, 1.4, 1.6, 56, 110.1, 111.6, 111.7, 111.8, 120.5),
                    Var2 = c(1.5, 10.1, 10.2, 11.7, 12.5, 13.6, 14.4, 18.1, 20.9, 21.3))
df <- rbind(df01, df001)
dfl <- df %>% gather(VAR, value, 3:4)

df.result <- df %>% 
  # get the unique elements of Location
  distinct(Location) %>% 
  # pull the column as a vector
  pull %>% 
  # it is factor, so convert it to character
  as.character %>% 
  # get the pairwise combinations in a list
  combn(m = 2, simplify = FALSE) %>%
  # loop through the list with map and do the full_join
  # with the long format data dfl
  map(~ full_join(dfl %>% 
                    filter(Location == first(.x)), 
                  dfl %>% 
                    filter(Location == last(.x)), by = "VAR") %>% 
        # create a column of absolute difference
        mutate(DIFF = abs(value.x - value.y)) %>%
        # grouped by VAR, Sample.x
        group_by(VAR, Sample.x) %>%
        # apply the top_n with wt as DIFF
        # here I choose 5, 
        # and then hope that this is enough to get a smaller n of final matches
        top_n(-5, DIFF) %>%
        mutate(GG = paste(Location.x, Location.y, sep="-")))

res1 <- rbindlist(df.result)
res2 <- res1 %>% group_by(GG, VAR) %>% distinct(Sample.y)    
res3 <- res2 %>% group_by(GG, VAR) %>% top_n(-2, DIFF)

Viewing all articles
Browse latest Browse all 205301

Trending Articles