I would like to infer shared genomic interval between different samples.
My input:
sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300
My expected output:
chr start end freq
1 100 150 2
2 100 150 2
Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).
Cheers!
This is certainly very long (and likely very inefficient on large data.frames given the expand.grid.df, however, I hope it gives you a starting point. As a caveat, I have no background in genomics (which I'm sure comes through) so had no idea of common packages for this. Surely those are the best way to go. I just thought it would be fun to attempt a solution.
If your data is in a data.frame (see below), using the Bioconductor GenomicRanges package I create a GRanges instance, keeping the non-range columns too
The discrete ranges represented by the data are given by the
disjoin
function, and the overlap between the disjoint ranges ('query') and your original ('subject') areSplit the sample information associated with each overlapping subject with the corresponding query, and associate it with the disjoint GRanges as
leading to for instance
Here's how I input your data:
Given the context behind this question, I suspect it's going to be worthwhile your learning the
GenomicRanges
package from Bioconductor.The approach being: find all self-overlaps, remove the trivial ones where an interval is being compared to itself (4th line), and then finding the intersection between each pair of remaining intervals. You can then tabulate the results however you wish.