Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 209748

How to find overlapping regions between two data frames based on conditions

$
0
0

I have two data frames, one called strain_1 and the other called strain_2. Each data frame has 4 columns (st_A, ed_A, st_B, ed_B : for "start" and "end" positions), but a different number of rows. st_A, ed_A and st_B, ed_B are the "start" and "end" positions of the block_A and block_B, respectively (see image 1 and the example below).

I am looking to identify the common overlapping blocks between strain_1 and strain_2.

Taking an example from image 1:

strain_1:
st_A    ed_A    st_B        ed_B
7       9       123         127
25      28      97          98
35      38      140         145


strain_2:
st_A    ed_A    st_B        ed_B
5       8       124         129
20      25      95          100
36      39      141         147
..      ..      ..          .. 
..      ..      ..          ..

From this example, we see three overlapping regions (image 1):

The overlapping region is defined by : the min value of st_A (or st_B) and max value of ed_A (or ed_B) for block_A and block_B, respectively (see image 2: green box = common region).

The objective is to create a new data frame with these common regions (pair of blocks)

    result_desired:
    st_A    ed_A     st_B     ed_B
    5       9        123      129
    20      28       95       100
    35      39       140      147

There are 16 possible combinations (see image 3), depending on the size of each block.

Is there a fast way to do this? knowing that I have data with several thousand lines.

I'm testing with an if-loop (based on image 3), but is not the same number of rows between data frames:

    for i in seq_along(strain_1){
        if (strain_1[i,1] <= strain_2[i,1] & strain_1[i,2] <= strain_2[i,2] & strain_1[i,3] <= strain_2[i,3] & strain_1[i,4] <= strain_2[i,4]){
        res[i,1] <- paste("start_b1:",strain_1[i,1], "end_b1:",strain_2[i,2], "start_b2 :", strain_1[i,3], "end_b2 :", strain_2[i,4]}
    else if (strain_1[i,1] <= strain_2[i,1] & strain_1[i,2] <= strain_2[i,2] & strain_1[i,3] >= strain_2[i,3] & strain_1[i,4] <= strain_2[i,4]){
 res[i,1] <- paste("start_b1:",strain_1[i,1], "end_b1:",strain_2[i,2], "start_b2 :", strain_2[i,3], "end_b2 :", strain_1[i,4]}
. case 3
. case 4
.
. 
.
else if (case 16) { res[i,1] <- paste("start_b1:",strain_2[i,1], "end_b1:", strain_2[i,2], "start_b2:", strain_2[i,3], "end_b2:",strain_2[i,4]}
else { res[i,1] <- ""}
}

Thanks for your help.


Viewing all articles
Browse latest Browse all 209748

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>