A range intersection algorithm better than O(n)?

2020-01-30 04:52发布

问题:

Range intersection is a simple, but non-trivial problem.

Its has been answered twice already:

  • Find number range intersection
  • Comparing date ranges

The first solutions is O(n) and the second solution is for a database (which is less than O(n) of course).

I have the same problem, but for a large n and I am not within a database.

This problem seems to be very similar to Store 2D points for quick retrieval of those inside a rectangle but I don't see how it maps.

So what data structure would you store the set of ranges in, such that a search on a range costs less than O(n)? (Extra credit for using libraries available for Java)

EDIT:

I want to get a subset of all intersecting ranges, meaning the search range could intersect multiple ranges.

The method that needs to be less than O(n) in Java is:

public class RangeSet {
    ....
    public Set<Range> intersects(Range range);
    ....
}

Where Range is just a class containing a pair of int start and end.

This is not an impossible question, I already have the solution, I just wanted to see if there was a more standard/simpler way of doing it

回答1:

The standard approach is to use an interval tree.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

The trivial solution is to visit each interval and test whether it intersects the given point or interval, which requires O(n) time, where n is the number of intervals in the collection. Since a query may return all intervals, for example if the query is a large interval intersecting all intervals in the collection, this is asymptotically optimal; however, we can do better by considering output-sensitive algorithms, where the runtime is expressed in terms of m, the number of intervals produced by the query. Interval trees have a query time of O(log n + m) and an initial creation time of O(n log n), while limiting memory consumption to O(n). After creation, interval trees may be dynamic, allowing efficient insertion and deletion of an interval in O(log n). If the endpoints of intervals are within a small integer range (e.g., in the range [1,...,O(n)]), faster data structures exist[1] with preprocessing time O(n) and query time O(1+m) for reporting m intervals containing a given query point.



回答2:

Edit: It sounds like this solution is more or less an Interval Tree. A more complete implementation of an Interval Tree can be found here.

class TreeNode
{
public:
    long pivot;
    List<Range> leaves;  //Any ranges that intersect the pivot
    TreeNode left;        //Tree nodes that fall to the left of the pivot
    TreeNode right;       //Tree nodes that fall to the right of the pivot
};

Prep O(n log n):

  1. Create the list of ranges
  2. Pick the pivot points (possibly by using a sorted list of the end dates.) ??
  3. Build your tree.

Search:

  1. Use binary search to find the first pivot that is >= TestRange.End
  2. Traverse the tree until the pivot > TestRange.Start

    2a. Add the leaves to your result.


Example:

Ranges:

  • 0 - 2
  • 1 - 2
  • 2 - 3
  • 1 - 4
  • 2 - 4
  • 0 - 5
  • 4 - 5
  • 2 - 6
  • 3 - 7

Tree:

                             4
               --------------+------------------
               3             |                 7
               |            1-4                |
               |            2-4                |
               |            0-5                |
               |            4-5                |
      ---------+------                 --------+--------
      2        |    null              6        |       null
 -----+----   2-3                 ----+----   3-7
null  |  null                   null  |  null    
     0-2                             2-6
     1-2


回答3:

Non Overlapping Ranges:

Prep O(n log n):

  1. Make a array / vector of the ranges.
  2. Sort the vector by the end of the range (break ties by sorting by the start of the range)

Search:

  1. Use binary search to find the first range with an End value of >= TestRange.Start
  2. Iterator starting at the binary search until you find an Start > TestRange.End:

    2a. If the range if the current range is within the TestRange, add it to your result.



回答4:

This depends on your exact problem, in the linked question, the ranges where distinct, no common part, and the searched ranged could span multiple ranges. If your problem is the same, its is really easy: Take an array of the ranges, sort them by their lowest values (since they do not overlap, this would be also the same order as sorted by their upper values).

Now just make a binsearch for the your target lower value (or smaller if not exact) and one for the target upper value(or bigger if not exact). The resulting indexes are the the ranges which are coverd. You have to check wheter the ranges at the indexes itself are in- or excluded, but that are just 2 checks. Overall complexity O(log n).



回答5:

Overlapping Ranges:

Prep O(n log n):

  1. Make a array / vector of the ranges.
  2. Sort the vector by the end of the range (break ties by sorting by the start of the range)
  3. Make a second vector of ints. This represents the point at which you can stop searching.

    int stop[size];
    stop[size-1] = Ranges[size - 1].start;
    for (int i = size - 2; i >= 0; i--)
    {
        stop[i] = min(Ranges[i].start, stop[i+1]);
    }
    

Search:

  1. Use binary search to find the first range with an End value of >= TestRange.Start
  2. Iterator starting at the binary search until stop[i] > TestRange.End:

    2a. If the range if the current range is within the TestRange, add it to your result.



回答6:

If ranges overlap, and one wants to retrieve all the ranges that overlap (or contain) a given target range, most of the solutions above don't appear to work.

As some have pointed out, if (worst-case) all the ranges happen to intersect the target range (for example, if the target range is {0..MAXINT} or similar) then of course it takes O(n) to return the n ranges.

But isn't the interesting and typical/average case, where only a very small % of the n total ranges do intersect the target range? Call the number that do intersect "m" -- in that case, you might conceivably be able to do as well as O(m). And if n=10^9 and m=10, that's a make-or-break difference.

Consider the simple case of a text document that has various regions marked up for their "type" -- perhaps you want to find all the marked-up units that contain or intersect a given contiguous range of text (for example, a paragraph). In HTML, XML, or similar those can only be ancestors of the text-node(s) containing at least some characters of the target range. In typical representations with parent pointers in each node, that's O(m) -- way better than O(n), especially because m is (for short or synchronous target ranges) merely the tree nesting depth, which tends to be even lower than ln(n) because big XML documents in practice get bushier not deeper.

The interesting case is harder: what if your "elements" do not form a tree as in XML, but can overlap as in MECS, CLIX, LMNL, and some other systems? You still want to find all the regions/"elements" that overlap your target, but they aren't so easily organized.

On the other hand, you should be able to do very well because marked-up ranges in many applications are most frequently small -- there are far more words, sentences, and paragraphs in a book, than there are chapters. So even though there may be a huge number of ranges that start before the target and a huge number that end after it, the intersection will be very small on average.

I think that's what the original questioner was getting at, and I'm afraid i didn't see an answer that addresses that problem. If it's not what the original question was about, then I'd like to pose it as a new question.



回答7:

Sounds like you need a class that implements the SortedSet interface. TreeSet is the implementation that ships with the core API.

Have one set holding the ranges sorted by lowest value, and one sorted by highest value.

You can then implement the equivalent of the database algorithm using the in-memory sets.

As for whether this is actually faster than O(n), I couldn't say.



回答8:

Just as a quad tree works for a set of 2d points, a simple binary tree should work for this case. Build a tree with your ranges.

To explain further: Each node in the tree contains two integers, the beginning and end of the range, and the two children if it's not a leaf node. To find the ranges that your input range spans, then starting at the top of the tree

  - if the node range intersects the input range:
     - if it's a leaf node, then add the range to your result list
     - if it's not a leaf node, then traverse down to the child nodes and repeat this process.

It should be O(logN)

Further detail: The binary tree would be structured like a 1-d version of a quad tree. Each node would have three integers (sorry I said two above, but now I realize you need three), the lowest representing the lowest value of the lowest range that's below this node, the highest representing the highest value of the highest range that's below this node, and the pivot. The left child would span from this node's lowest to its pivot. The right child would span from this node's pivot to this node's highest. If there is only one range that goes from "lowest" to "highest", you wouldn't have a pivot and this would be a leaf. Ideally you'd pick the pivots for each node to keep the tree balanced.



回答9:

When I had this problem, I used a sorted array of ranges and a binary search to look for intersections. This is (I believe) O(log n) performance, with a little bit of overhead to deal with overlapping ranges.

The answer to your question is, I think, derivable from the code below, but stopping short of the insertion. I present the entire code to avoid confusion by the differing context - I needed to insert a range of Unicode codepoints into a list of codepoint ranges.

-- EDIT --

Adapting the code below to determine intersections of multiple ranges involves a trivial forward search from the insertion point until a range is found which no longer intersects.

-- END EDIT --

The Range class contains:

final int                               lower;                                  // lower end of range
final int                               upper;                                  // upper end of range

public int compareTo(Object obj) {
    if(obj==null) { return -1; }

    Range                           oth=(Range)obj;

    if(lower<oth.lower) { return -1; }
    if(lower>oth.lower) { return  1; }
    if(upper<oth.upper) { return -1; }
    if(upper>oth.upper) { return  1; }
    return 0;
    }

Range Insertion:

public Builder addRange(int fir, int las) {
    if(fir!=-1) { fir&=0x001FFFFF; }
    if(las!=-1) { las&=0x001FFFFF; }

    if(codepoints==null || codepoints.length==0) {
        codepoints=new Range[]{new Range(fir,las)};
        }
    else {
        int                         idx=Range.findChar(codepoints,fir);
        int                         ins=(idx<0 ? -(idx+1) : idx);

        if(idx<0) {
            if     (ins>0                 && fir==(codepoints[ins-1].upper+1)) { idx=(ins-1); }  // new range adjoins the following range (can't overlap or idx would be >=0)
            else if(ins<codepoints.length && las>=(codepoints[ins  ].lower-1)) { idx=ins;     }  // new range overlaps or adjoins the following range
            }

        if(idx<0) {
            codepoints=(Range[])Util.arrayInsert(codepoints,ins,new Range(fir,las));
            }
        else {
            boolean                 rmv=false;

            for(int xa=(idx+1); xa<codepoints.length && codepoints[xa].lower<=las; xa++) {
                if(las<codepoints[xa].upper) { las=codepoints[xa].upper; }
                codepoints[xa]=null;
                rmv=true;
                }
            if(codepoints[idx].lower>fir || codepoints[idx].upper<las) {
                codepoints[idx]=new Range((codepoints[idx].lower < fir ? codepoints[idx].lower : fir),(codepoints[idx].upper>las ? codepoints[idx].upper : las));
                }
            if(rmv) { codepoints=Range.removeNulls(codepoints); }
            }
        }
    return this;
    }

Binary Search:

static int findChar(Range[] arr, int val) {
    if(arr.length==1) {
        if     (val< arr[0].lower) { return -1; }                             // value too low
        else if(val<=arr[0].upper) { return  0; }                             // value found
        else                       { return -2; }                             // value too high
        }
    else {
        int                             lowidx=0;                               // low index
        int                             hghidx=(arr.length-1);                  // high index
        int                             mididx;                                 // middle index
        Range                           midval;                                 // middle value

        while(lowidx<=hghidx) {
            mididx=((lowidx+hghidx)>>>1);
            midval=arr[mididx];
            if     (val< midval.lower) { hghidx=(mididx-1); }                   // value too low
            else if(val<=midval.upper) { return mididx;     }                   // value found
            else                       { lowidx=(mididx+1); }                   // value too high
            }
        return -(lowidx+1);                                                     // value not found.
        }
    }