Subset only those rows whose intervals does not fa

2019-08-08 01:08发布

问题:

How can i compare two data frames (test and control) of unequal length, and remove the row from test based on three criteria, i) if the test$chr == control$chr ii) test$start and test$end lies with in the range of control$start and control$end iii) test$CNA and control$CNA are same.

    test = 
        R_level  logp   chr start   end     CNA    Gene
        2     7.079     11  1159    1360    gain   Recl,Bcl
        11    2.4       12  6335    6345    loss   Pekg
        3     19        13  7180    7229    loss   Sox1

control =

  R_level    logp   chr  start  end     CNA    Gene
        2     5.9     11  1100  1400    gain   Recl,Bcl 
        2     3.46    11  1002  1345    gain    Trp1
        2     6.4     12  6705  6845    gain    Pekg
        4     7       13  6480  8129    loss    Sox1

The result should look something like this

result =
     R_level     logp   chr start   end     CNA     Gene
          11      2.4    12  6335   6345    loss   Pekg

回答1:

Here's one way using foverlaps() from data.table.

require(data.table) # v1.9.4+
dt1 <- as.data.table(test)
dt2 <- as.data.table(control)
setkey(dt2, chr, CNA, start, end)

olaps = foverlaps(dt1, dt2, nomatch=0L, which=TRUE, type="within")
#    xid yid
# 1:   1   2
# 2:   3   4

dt1[!olaps$xid]
#    R_level logp chr start  end  CNA Gene
# 1:      11  2.4  12  6335 6345 loss Pekg

Read ?foverlaps and see the examples section for more info.

Alternatively, you can also use GenomicRanges package. However, you might have to filter based on CNA after merging by overlapping regions (AFAICT).



回答2:

When you say "exclude the variable", I assume you mean you want to remove the rows that satisfies those criteria.

If so, you are nearly there. The following should work:

exclude_bool <- data1[,3] == data2[,3] &
data1[,4] > data2[,5] &
data1[,5] < data2[,4] &
data1[,6] == data2[,6] 

data1 <- data1[!exclude_bool , ]