How can I efficiently count ranges that cover a gi

2019-05-10 06:21发布

I have a database of some 30k ranges, each is given as a pair of start and end points:

[12,80],[34,60],[34,9000],[76,743],...

I would like to write a Perl subroutine that a range (not from the database), and returns the number of ranges in the database which fully 'include' the given range.

For example, if we had only those 4 ranges in the database and the query range is [38,70], the subroutine should return 2, since the first and third ranges both fully contain the query range.

The problem: I wish to make the queries as "cheap" as possible, I don't mind doing lots of pre-processing, if it helps.

A couple of notes:

  1. I used the word "database" freely, I don't mean an actual database (e.g. SQL); it's just a long list of ranges.

  2. My world is circular... There is a given max_length (e.g. 9999) and ranges like [8541,6] are legal (you can think of it as a single range that is the union of [8541,9999] and [1,6]).

Thanks, Dave

UPDATE This was my original code:

use strict;
use warnings;

my $max_length = 200;
my @ranges     = (
    { START => 10,   END => 100 },
    { START => 30,   END => 90 },
    { START => 50, END => 80 },
    { START => 180,  END => 30 }
);

sub n_covering_ranges($) {
    my ($query_h) = shift;
    my $start     = $query_h->{START};
    my $end       = $query_h->{END};
    my $count     = 0;
    if ( $end >= $start ) {

        # query range is normal
        foreach my $range_h (@ranges) {
            if (( $start >= $range_h->{START} and $end <= $range_h->{END} )
                or (    $range_h->{END} <= $range_h->{START} and  $range_h->{START} <= $end )
                or ( $range_h->{END} <= $range_h->{START} and  $range_h->{END} >= $end)
                )
            {
                $count++;
            }
        }

    }

    else {

        # query range is hanging over edge
        # only other hanging over edges can contain it
        foreach my $range_h (@ranges) {
            if ( $start >= $range_h->{START} and $end <= $range_h->{END} ) {
                $count++;
            }
        }

    }

    return $count;
}

print n_covering_ranges( { START => 1, END => 10 } ), "\n";
print n_covering_ranges( { START => 30, END => 70 } ), "\n";

and, yes, I know the ifs are ugly and can be made much nicer and more efficient.

UPDATE 2 - BENCHMARKING SUGGESTED SOLUTIONS

I've don some benchmarking for the two purposed solutions so far: the naive one, suggested by cjm, which is similar to my original solutions, and the memory-demanding one, suggested by Aristotle Pagaltzis Thanks again for both of you!

To compare the two, I created the following packages which use the same interface:

use strict;
use warnings;

package RangeMap;

sub new {
    my $class      = shift;
    my $max_length = shift;
    my @lookup;
    for (@_) {
        my ( $start, $end ) = @$_;
        my @idx
            = $end >= $start
            ? $start .. $end
            : ( $start .. $max_length, 0 .. $end );
        for my $i (@idx) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;
    return 0 unless defined $self->[$start];
    return 0 + grep { $end <= $_ } unpack 'L*', $self->[$start];
}

1;

and:

use strict;
use warnings;

package cjm;

sub new {
    my $class      = shift;
    my $max_length = shift;

    my $self = {};
    bless $self, $class;

    $self->{MAX_LENGTH} = $max_length;

    my @normal  = ();
    my @wrapped = ();

    foreach my $r (@_) {
        if ( $r->[0] <= $r->[1] ) {
            push @normal, $r;
        }
        else {
            push @wrapped, $r;
        }
    }

    $self->{NORMAL}  = \@normal;
    $self->{WRAPPED} = \@wrapped;
    return $self;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;

    if ( $start <= $end ) {

        # This is a normal range
        return ( grep { $_->[0] <= $start and $_->[1] >= $end }
                @{ $self->{NORMAL} } )
            + ( grep { $end <= $_->[1] or $_->[0] <= $start }
                @{ $self->{WRAPPED} } );
    }
    else {

        # This is a wrapped range
        return ( grep { $_->[0] <= $start and $_->[1] >= $end }
                @{ $self->{WRAPPED} } )

            # This part should probably be calculated only once:
            + ( grep { $_->[0] == 1 and $_->[1] == $self->{MAX_LENGTH} }
                @{ $self->{NORMAL} } );
    }
}

1;

I then used some real data: $max_length=3150000, about 17000 ranges with an average size of a few thousands, and finally queried the objects with some 10000 queries. I timed the creation of the object (adding all the ranges) and the querying. The results:

cjm creation done in 0.0082 seconds
cjm querying done in 21.209857 seconds
RangeMap creation done in 45.840982 seconds
RangeMap querying done in 0.04941 seconds

Congratulations Aristotle Pagaltzis! Your implementation is super-fast! To use this solution, however, I will obviously like to do the pre-processing (creation) of the object once. Can I store (nstore) this object after its creation? I've Never done this before. And how should I retrieve it? Anything special? Hopefully the retrieval will be fast so it won't effect the overall performance of this great data structure.

UPDATE 3

I tried a simple nstore and retrieve for the RangeMap object. This seems to work fine. The only problem is the resulting file is around 1GB, and I will have some 1000 such file. I could live with a TB of storage for this, but I wonder if there's anyway to store it more efficiently without significantly effecting retrieval performance too much. Also see here: http://www.perlmonks.org/?node_id=861961.

UPDATE 4 - RangeMap bug

Unfortunately, RangeMap has a bug. Thanks to BrowserUK from PerlMonks for pointing that out. For example, create an object with $max_lenght=10 and as single range [6,2]. Then query for [7,8]. The answer should be 1, not 0.

I think this updated package should do the work:

use strict;
use warnings;

package FastRanges;

sub new($$$) {
    my $class      = shift;
    my $max_length = shift;
    my $ranges_a   = shift;
    my @lookup;
    for ( @{$ranges_a} ) {
        my ( $start, $end ) = @$_;
        my @idx
            = $end >= $start
            ? $start .. $end
            : ( $start .. $max_length, 1 .. $end );
        for my $i (@idx) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing($$$) {
    my $self = shift;
    my ( $start, $end ) = @_;    # query range coordinates

    return 0
        unless ( defined $self->[$start] )
        ;    # no ranges overlap the start position of the query

    if ( $end >= $start ) {

        # query range is simple
        # any inverted range in {LOOKUP}[$start] must contain it,
        # and so does any simple range which ends at or after $end
        return 0 + grep { $_ < $start or $end <= $_ } unpack 'L*',
            $self->[$start];
    }
    else {

        # query range is inverted
        # only inverted ranges in {LOOKUP}[$start] which also end
        # at of after $end contain it. simple ranges can't contain
        # the query range
        return 0 + grep { $_ < $start and $end <= $_ } unpack 'L*',
            $self->[$start];
    }
}

1;

Your comments will be welcomed.

6条回答
霸刀☆藐视天下
2楼-- · 2019-05-10 06:59

Do you have a lot of memory available?

my $max_length = 9999;
my @range = ( [12,80],[34,60],[34,9000] );

my @lookup;

for ( @range ) {
    my ( $start, $end ) = @$_;
    my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 0 .. $end );
    for my $i ( @idx ) { $lookup[$i] .= pack "L", $end }
}

Now you have an array of packed number lists in @lookup where the packed list at each index contains the ends of all ranges which include that point. So to check how many ranges contain another range, you look up its starting index in the array, and then count the number of entries from the packed list at that index, which are smaller or equal that the ending index. This algorithm is O(n) with respect to the maximum number of ranges covering any one point (with the limit being the total number of ranges), with a very small overhead per iteration.

sub num_ranges_containing {
    my ( $start, $end ) = @_;

    return 0 unless defined $lookup[$start];

    # simple ranges can be contained in inverted ranges,
    # but inverted ranges can only be contained in inverted ranges
    my $counter = ( $start <= $end )
        ? sub { 0 + grep { $_ < $start or  $end <= $_ } }
        : sub { 0 + grep { $_ < $start and $end <= $_ } };

    return $counter->( unpack 'L*', $lookup[$start] );
}

Untested.

For extra neatness,

package RangeMap;

sub new {
    my $class = shift;
    my $max_length = shift;
    my @lookup;
    for ( @_ ) {
        my ( $start, $end ) = @$_;
        my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 0 .. $end );
        for my $i ( @idx ) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;

    return 0 unless defined $self->[$start];

    # simple ranges can be contained in inverted ranges,
    # but inverted ranges can only be contained in inverted ranges
    my $counter = ( $start <= $end )
        ? sub { 0 + grep { $_ < $start or  $end <= $_ } }
        : sub { 0 + grep { $_ < $start and $end <= $_ } };

    return $counter->( unpack 'L*', $self->[$start] );
}

package main;
my $rm = RangeMap->new( 9999, [12,80],[34,60],[34,9000] );

That way you can have any number of ranges.

Also untested.

查看更多
The star\"
3楼-- · 2019-05-10 07:07

I think problems like this illustrate the maintainability benefits of breaking a job down into to small, easily grasped pieces (admittedly, with one cost being more lines of code).

The simplest idea is that of an ordinary, non-wrapping range.

package SimpleRange;

sub new {
    my $class = shift;
    my ($m, $n) = @_;
    bless { start => $m, end => $n }, $class;
}

sub start { shift->{start} }
sub end   { shift->{end}   }

sub covers {
    # Returns true if the range covers some other range.
    my ($self, $other) = @_;
    return 1 if $self->start <= $other->start
            and $self->end   >= $other->end;
    return;
}

Using that building block, we can create wrapping range class, which consists of either 1 or 2 simple ranges (2 if the range wraps around the edge of the universe). Like the class for simple ranges, this class defines a covers method. The logic in that method is fairly intuitive because we can use the covers method provided by our SimpleRange objects.

package WrappingRange;

sub new {
    my $class = shift;
    my ($raw_range, $MIN, $MAX) = @_;
    my ($m, $n) = @$raw_range;

    # Handle special case: a range that wraps all the way around.
    ($m, $n) = ($MIN, $MAX) if $m == $n + 1;

    my $self = {min => $MIN, max => $MAX};
    if ($m <= $n){
        $self->{top}  = SimpleRange->new($m, $n);
        $self->{wrap} = undef;
    }
    else {
        $self->{top}  = SimpleRange->new($m, $MAX);
        $self->{wrap} = SimpleRange->new($MIN, $n);    
    }
    bless $self, $class;
}

sub top  { shift->{top}  }
sub wrap { shift->{wrap} }
sub is_simple { ! shift->{wrap} }

sub simple_ranges {
    my $self = shift;
    return $self->is_simple ? $self->top : ($self->top, $self->wrap);
}

sub covers {
    my @selfR  = shift->simple_ranges;
    my @otherR = shift->simple_ranges;
    while (@selfR and @otherR){
        if ( $selfR[0]->covers($otherR[0]) ){
            shift @otherR;
        }
        else {
            shift @selfR;
        }
    }
    return if @otherR;
    return 1;
}

Run some tests:

package main;
main();

sub main {
    my ($MIN, $MAX) = (0, 200);

    my @raw_ranges = (
        [10, 100], [30, 90], [50, 80], [$MIN, $MAX],
        [180, 30], 
        [$MAX, $MAX - 1], [$MAX, $MAX - 2],
        [50, 49], [50, 48],
    );
    my @wrapping_ranges = map WrappingRange->new($_, $MIN, $MAX), @raw_ranges;

    my @tests = ( [1, 10], [30, 70], [160, 10], [190, 5] );
    for my $t (@tests){
        $t = WrappingRange->new($t, $MIN, $MAX);

        my @covers = map $_->covers($t) ? 1 : 0, @wrapping_ranges;

        my $n;
        $n += $_ for @covers;
        print "@covers  N=$n\n";
    }
}

Output:

0 0 0 1 1 1 1 1 1  N=6
1 1 0 1 0 1 1 1 0  N=6
0 0 0 1 0 1 0 1 1  N=4
0 0 0 1 1 1 0 1 1  N=5
查看更多
Animai°情兽
4楼-- · 2019-05-10 07:09

Pretty sure there's a better way to do this, but here's a starting point:

Preprocessing:

  • Create two lists, one sorted by the start value of the ranges, one by the end.

Once you get your range:

  • Use a binary search to match it's start in the start-sorted list
  • Use another binary search to match it's end in the end-sorted list
  • Find the ranges that appear in both lists (@start[0..$start_index] and @end[$end_index..$#end]).
查看更多
乱世女痞
5楼-- · 2019-05-10 07:12

Here's one approach to the brute force solution:

use strict;
use warnings;

my @ranges = ([12,80],[34,60],[34,9000],[76,743]);

# Split ranges between normal & wrapped:
my (@normal, @wrapped);

foreach my $r (@ranges) {
  if ($r->[0] <= $r->[1]) {
    push @normal, $r;
  } else {
    push @wrapped, $r;
  }
}

sub count_matches
{
  my ($start, $end, $max_length, $normal, $wrapped) = @_;

  if ($start <= $end) {
    # This is a normal range
    return (grep { $_->[0] <= $start and $_->[1] >= $end } @$normal)
        +  (grep { $end <= $_->[1] or $_->[0] <= $start } @$wrapped);
  } else {
    # This is a wrapped range
    return (grep { $_->[0] <= $start and $_->[1] >= $end } @$wrapped)
        # This part should probably be calculated only once:
        +  (grep { $_->[0] == 1 and $_->[1] == $max_length } @$normal);
  }
} # end count_matches

print count_matches(38,70, 9999, \@normal, \@wrapped)."\n";
查看更多
萌系小妹纸
6楼-- · 2019-05-10 07:12

There's an easier way than rolling your own ranges: use Number::Interval:

my @ranges     = (
    { START => 10,   END => 100 },
    { START => 30,   END => 90 },
    { START => 50, END => 80 },
    { START => 180,  END => 30 }
);
my @intervals;
for my $range ( @ranges ) {
  my $int = new Number::Interval( Min => $range->{START},
                                  Max => $range->{END} );
  push @intervals, $int;
}

Then you can use the intersection() method to find out if two ranges overlap:

my $num_overlap = 0;
my $checkinterval = new Number::Interval( Min => $min, Max => $max );
for my $int ( @intervals ) {
  $num_overlap++ if $checkinterval->intersection( $int );
}

I'm not quite sure what it will do with your "circular" ranges (they'd be classified as "inverted" intervals by Number::Interval) so you'd have to do a little experimentation.

But using a module really beats rolling your own range comparison methods.

Edit: Actually, looking at the documentation a little more closely, intersection() won't do what you want (in fact, it modifies one of the interval objects). You probably want to use contains() on the start and end values, and if both of those are contained within another interval, then that first interval is contained by the second.

Of course, you could update Number::Interval to add this functionality... :-)

查看更多
相关推荐>>
7楼-- · 2019-05-10 07:13

Which part are you having problems with? What have you tried so far? It's a fairly simple task:

  * Iterate through the ranges
  * Foreach range, check if the test range is in it.
  * Profile and benchmark

It's fairly simple Perl:

 my $test = [ $n, $m ];
 my @contains = map { 
      $test->[0] >= $_->[0] 
         and 
      $test->[1] <= $_->[1]
      } @ranges

For the wrap-around ranges, the trick is to decompose those into separate ranges before you look at them. It's brute force work.

And, just as a social note, the rate of your question asking is pretty high: higher than I would expect of someone who is genuinely trying to solve their own problems. I think you're running to Stackoverflow too quickly and instead of getting help, you're really outsourcing your job. That's not really that nice. We don't get paid at all, and especially not paid to do the work assigned to you. This might be much different if you at least tried an implementation of your problem, but a lot of your questions seem to indicate that you didn't even try.

查看更多
登录 后发表回答