Quickest way to determine range overlap in Perl

2020-06-23 06:23发布

问题:

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).

$a_ranges =
{
  a_1 =>
  {
    start => ...,
    end   => ...,
  },
  a_2 =>
  {
    start => ...,
    end   => ...,
  },
  a_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

$b_ranges =
{
  b_1 =>
  {
    start => ...,
    end   => ...,
  },
  b_2 =>
  {
    start => ...,
    end   => ...,
  },
  b_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.

I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.

Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?

Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.

回答1:

This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.

I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.

Hope this helps!



回答2:

In case you are not exclusively tied to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.

A second remark is that the problem could possibly become very easy if the intervals have additional structure; for example, if within each set of ranges there is no overlap (in that case a linear approach sifting through the two ordered sets simultaneously is possible). Even in the absence of such structure, the least you can do is to sort one set of ranges by start point, and the other set by end point, then break out of the inner loop once a match is no longer possible. Of course, existing and general algorithms and data structures such as the interval tree mentioned earlier are the most powerful.



回答3:

There are Several existing CPAN modules that solve this issue, I have developed 2 of them: Data::Range::Compare and Data::Range::Compare::Stream

Data::Range::Compare only works with arrays in memory, but supports generic range types.

Data::Range::Compare::Stream Works with streams of data via iterators, but it requires understanding OO Perl to extend to generic data types. Data::Range::Compare::Stream is recommended if you are processing very very large sets of data.

Here is an Excerpt form the Examples folder of Data::Range::Compare::Stream.

Given these 3 sets of data:

Numeric Range set: A contained in file: source_a.src
+----------+
| 1 - 11   |
| 13 - 44  |
| 17 - 23  |
| 55 - 66  |
+----------+

Numeric Range set: B contained in file: source_b.src
+----------+
| 0  - 1   |
| 2  - 29  |
| 88 - 133 |
+----------+

Numeric Range set: C contained in file: source_c.src
+-----------+
| 17  - 29  |
| 220 - 240 |
| 241 - 250 |
+-----------+

The expected output would be:

+--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+
| 0 - 0        |   No Data       |   0 - 1         |   No Data       |
| 1 - 1        |   1 - 11        |   0 - 1         |   No Data       |
| 2 - 11       |   1 - 11        |   2 - 29        |   No Data       |
| 12 - 12      |   No Data       |   2 - 29        |   No Data       |
| 13 - 16      |   13 - 44       |   2 - 29        |   No Data       |
| 17 - 29      |   13 - 44       |   2 - 29        |   17 - 29       |
| 30 - 44      |   13 - 44       |   No Data       |   No Data       |
| 55 - 66      |   55 - 66       |   No Data       |   No Data       |
| 88 - 133     |   No Data       |   88 - 133      |   No Data       |
| 220 - 240    |   No Data       |   No Data       |   220 - 240     |
| 241 - 250    |   No Data       |   No Data       |   241 - 250     |
+--------------------------------------------------------------------+

The Source code can be found here.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
use lib qw(./ ../lib);

# custom package from FILE_EXAMPLE.pl
use Data::Range::Compare::Stream::Iterator::File;


use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::Consolidate;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;

my $source_a=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_a.src');
my $source_b=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_b.src');
my $source_c=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_c.src');

my $consolidator_a=new Data::Range::Compare::Stream::Iterator::Consolidate($source_a);
my $consolidator_b=new Data::Range::Compare::Stream::Iterator::Consolidate($source_b);
my $consolidator_c=new Data::Range::Compare::Stream::Iterator::Consolidate($source_c);


my $compare=new  Data::Range::Compare::Stream::Iterator::Compare::Asc();

my $src_id_a=$compare->add_consolidator($consolidator_a);
my $src_id_b=$compare->add_consolidator($consolidator_b);
my $src_id_c=$compare->add_consolidator($consolidator_c);



print "  +--------------------------------------------------------------------+
  | Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
  +--------------------------------------------------------------------+\n";

my $format='  | %-12s |   %-13s |   %-13s |   %-13s |'."\n";
while($compare->has_next) {
  my $result=$compare->get_next;
  my $string=$result->to_string;
  my @data=($result->get_common);
  next if $result->is_empty;
  for(0 .. 2) {
     my $column=$result->get_column_by_id($_);
     unless(defined($column)) {
      $column="No Data";
    } else {
      $column=$column->get_common->to_string;
    }
     push @data,$column;


   }
  printf $format,@data;

  } 
 print "  +--------------------------------------------------------------------+\n";


回答4:

Try Tree::RB but to find mutually exclusive ranges, no overlaps

The performance is not bad, if I had about 10000 segments and had to find the segment for each discrete number. My input had 300 million records. I neaded to put them into separate buckets. Like partitioning the data. Tree::RB worked out great.

$var = [
[0,90],
[91,2930],
[2950,8293]
.
.
.
]

my lookup value were 10, 99, 991 ...

and basically I needed the position of the range for the given number

With this below comparison function, mine uses something like this:

my $cmp = sub
{
    my ($a1, $b1) = @_;

    if(ref($b1) && ref($a1))
    {
        return ($$a1[1]) <=> ($$b1[0]);
    }

    my $ret = 0;

    if(ref($a1) eq 'ARRAY')
    {
        #
        if($$a1[0] <= $b1 && $b1 >= $$a1[1])
        {
            $ret =  0;
        }
        if($$a1[0] < $b1)
        {
            $ret =  -1;
        }
        if($$a1[1] > $b1)
        {
            $ret =  1;
        }
    }
    else
    {
        if($$b1[0] <= $a1 && $a1 >= $$b1[1])
        {
            $ret =  0;
        }
        if($$b1[0] > $a1)
        {
            $ret =  -1;
        }
        if($$b1[1] < $a1)
        {
            $ret =  1;
        }
    }

    return $ret;
}


回答5:

I should check time in order to know if its the fastest way, but according to the structure of your data you should try this:

use strict;

my $fromA = 12;
my $toA = 15;
my $fromB = 7;
my $toB = 35;

my @common_range = get_common_range($fromA, $toA, $fromB, $toB);
my $common_range = $common_range[0]."-".$common_range[-1];

sub get_common_range {
  my @A = $_[0]..$_[1];
  my %B = map {$_ => 1} $_[2]..$_[3];
  my @common = ();

  foreach my $i (@A) {
    if (defined $B{$i}) {
      push (@common, $i);
    } 
  }
  return sort {$a <=> $b} @common;
}