Merge overlapping numeric ranges into continuous r

2020-06-06 02:32发布

问题:

I am trying to merge a range of genomic coordinates into continuous ranges, with an additional option for merging across gaps.

For example, if I had the genomic ranges [[0, 1000], [5, 1100]] I would want the result to be [0, 1100]. If the offset option was set to 100, and the input was [[0, 1000], [1090, 1000]] I would once again want the result to be [0, 1100].

I have implemented a way of doing this that steps through the alignments sequentially and tries to merge on the previous ending position and next starting position, but it fails because the actual results have varying lengths. For example I have the results [[138, 821],[177, 1158], [224, 905], [401, 1169]] in my list sorted by the start positions. The answer to that should be [138, 1169] but I instead get [[138, 1158], [177, 905], [224, 1169]]. Obviously I need to take more into account than just the previous ending and the next start, but I haven't found a good solution (preferably one that isn't a huge nest of if statements). Anyone have any suggestions?

def overlap_alignments(align, gene, overlap):
    #make sure alignments are sorted first by chromosome then by start pos on chrom
    align = sorted(align, key = lambda x: (x[0], x[1]))
    merged = list()
    for i in xrange(1, len(align)):
        prv, nxt = align[i-1], align[i]
        if prv[0] == nxt[0] and prv[2] + overlap >= nxt[1]:
            start, end = prv[1], nxt[2]
            chrom = prv[0]
            merged.append([chrom, start, end, gene])
    return merged

回答1:

Well, how about keeping track of every start and end and the number of ranges where each position belongs to?

def overlap_alignments(align, overlap):
    # create a list of starts and ends
    stends = [ (a[0], 1) for a in align ]
    stends += [ (a[1] + overlap, -1) for a in align ]
    stends.sort(key=lambda x: x[0])

    # now we should have a list of starts and ends ordered by position,
    # e.g. if the ranges are 5..10, 8..15, and 12..13, we have
    # (5,1), (8,1), (10,-1), (12,1), (13,-1), (15,-1)

    # next, we form a cumulative sum of this
    s = 0
    cs = []
    for se in stends:
        s += se[1]
        cs.append((se[0], s))
    # this is, with the numbers above, (5,1), (8,2), (10,1), (12,2), (13,1), (15,0)
    # so, 5..8 belongs to one range, 8..10 belongs to two overlapping range,
    # 10..12 belongs to one range, etc

    # now we'll find all contiguous ranges
    # when we traverse through the list of depths (number of overlapping ranges), a new
    # range starts when the earlier number of overlapping ranges has been 0
    # a range ends when the new number of overlapping ranges is zero 
    prevdepth = 0
    start = 0
    combined = []
    for pos, depth in cs:
        if prevdepth == 0:
            start = pos
        elif depth == 0
            combined.append((start, pos-overlap))
        prevdepth = depth

    return combined

This would be easier to draw than to explain. (And yes, the cumulative sum could be written in a shorter space, but i find it clearer this way.)

To explain this graphically, lets take input ([5,10],[8,15],[12,13],[16,20]) and overlap=1.

.....XXXXXo.............. (5-10)
........XXXXXXXo......... (8-15)
............Xo........... (12-13)
................XXXXo.... (16-20)
.....1112221221111111.... number of ranges at each position
.....----------------.... number of ranges > 0
.....---------------..... overlap corrected (5-20)


回答2:

Python comes batteries included:

from itertools import chain

flatten = chain.from_iterable

LEFT, RIGHT = 1, -1

def join_ranges(data, offset=0):
    data = sorted(flatten(((start, LEFT), (stop + offset, RIGHT))
            for start, stop in data))
    c = 0
    for value, label in data:
        if c == 0:
            x = value
        c += label
        if c == 0:
            yield x, value - offset

if __name__ == '__main__':
    print list(join_ranges([[138, 821], [900, 910], [905, 915]]))
    print list(join_ranges([[138, 821], [900, 910], [905, 915]], 80))

Result:

[(138, 821), (900, 915)]
[(138, 915)]

How it works: we label every start and end point as such, then we sort and afterwards we simply count up for every start point and down for every end point. If we have visited the same number of start and end points, we have a closed (joined) range.