Increasing the performance of visualising overlapp

2019-04-10 07:55发布

问题:

I have a set of pairs of x points to draw segments along the x axis to create a custom read map in R:

Half the task of plotting these segments is deciding their y positions so that no two segments that overlap are on the same y level. For each segment, I iterate over y levels from the first position until I get to a position that does not yet contain a segment that will overlap the current one. I then record the end position of the current segment and move to the next one.

The actual code is a function as follows:

# Dummy data
# A list of start and end positions for each segment along the X axis. Sorted by start.
# Passing the function few.reads draws a map in half a second. Passing it many.reads takes about half an hour to complete.
few.reads <- data.frame( start=c(rep(10,150), rep(16,100), rep(43,50)), end=c(rep(30,150), rep(34,100), rep(57,50)) );
many.reads <- data.frame( start=c(rep(10,15000), rep(16,10000), rep(43,5000)), end=c(rep(30,15000), rep(34,10000), rep(57,5000)) );

#---
# A function to draw a series of overlapping segments (or "reads" in my along
# The x-axis. Where reads overlap, they are "stacked" down the y axis
#---
drawReads <- function(reads){

    # sort the reads by their start positions
    reads <- reads[order(reads$start),];

    # minimum and maximum for x axis
    minstart <- min(reads$start);
    maxend <- max(reads$end);

    # initialise yread: a list to keep track of used y levels
    yread <- c(minstart - 1);
    ypos <- c(); #holds the y position of the ith segment

    #---
    # This iteration step is the bottleneck. Worst case, when all reads are stacked on top
    # of each other, it has to iterate over many y levels to find the correct position for
    # the later reads
    #---
    # iterate over segments
    for (r in 1:nrow(reads)){
        read <- reads[r,];
        start <- read$start;
        placed <- FALSE;

        # iterate through yread to find the next availible
        # y pos at this x pos (start)
        y <- 1;
        while(!placed){

            if(yread[y] < start){
                ypos[r] <- y;
                yread[y] <- read$end;
                placed <- TRUE;
            } 

            # current y pos is used by another segment, increment
            y <- y + 1;
            # initialize another y pos if we're at the end of the list
            if(y > length(yread)){
                yread[y] <- minstart-1;
            }
        }
    }

    #---
    # This is the plotting step
    # Once we are here the rest of the process is very quick
    #---
    # find the maximum y pos that is used to size up the plot
    maxy <- length(yread);
    miny = 1;


    reads$ypos <- ypos + miny;

    print("New Plot...")
    # Now we have all the information, start the plot
    plot.new();
    plot.window(xlim=c(minstart, maxend+((maxend-minstart)/10)), ylim=c(1,maxy));

    axis(3,xaxp=c(minstart,maxend,(maxend-minstart)/10));
    axis(2, yaxp=c(miny,maxy,3),tick=FALSE,labels=FALSE);

    print("Draw the reads...");
    maxy <- max(reads$ypos);
    segments(reads$start, maxy-reads$ypos, reads$end, maxy-reads$ypos, col="blue");   
}

My actual dataset is very large, and contains regions that can have up to 600000 reads as far as I can tell. The reads will naturally stack on top of each other, so it is very easy to realise the worst-case scenario, where all reads are overlapping each other. The time it takes to plot large numbers of reads is unacceptable for me, so I'm looking for a way to make the process more efficient. Can I replace my loops with something quicker? Is there an algorithm that can arrange the reads quicker? I really can't think of a better way of doing this at the moment.

Thanks for your help.

回答1:

Fill each y-level in a greedy fashion. After a level is filled, go one level down and never go back up.

Pseudocode:

 y <- 1
 while segment-list.not-empty
   i <- 1
   current <- segment-list[i]
   current.plot(y)
   segment-list.remove(i)
   i <- segment-list.find_first_greater(current.end)
   while (i > 0)
     current <- segment-list[i]
     current.plot(y)
     segment-list.remove(i)
   y <- y + 1

This does not necessarily produce an "optimal" plot in any sense, but at least it's O(n log n).



回答2:

Can you not sort on the start value? Then you walk through the list from front to back. For each item, plot it, then do a binary search on the remainder of the list for the first item greater than the end coordinate of the item just plotted. If none is found, increase Y. Remove each item as you plot it.

The sort is O(N lg N) and the binary search is O(lg N) for each item, so the total is O(N lg N).