Data structure for range query

2020-05-23 15:53发布

I was recently asked a coding question on the below problem. I have some solution to this problem but I am not very sure if those are most efficient.


Problem:

Write a program to track set of text ranges. Start point and end point will be string.

Text range example : [AbA-Ef]
 Aa would fall before this range
 AB would fall inside this range
 etc.

String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'

We need to support following operations on this range

  • Add range - this should merge the ranges if applicable
  • Delete range - this deletes range from tracked ranges and recompute the ranges
  • Query range - Given a character, function should return whether it is part of any of tracked ranges or not.

Note that tracked ranges can be dis-continuous.


My solutions:

I came up with two approaches.

  1. Store ranges as doubly linked list or
  2. Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.

Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?

4条回答
一纸荒年 Trace。
2楼-- · 2020-05-23 16:34

I think you would go for B+ tree it's the same which you have mentioned as your second approach.

Here are some properties of B+ tree:

  1. All data is stored leaf nodes.
  2. Every leaf is at the same level.
  3. All leaf nodes have links to other leaf nodes.

Here are few applications B+ tree:

  1. It reduces the number of I/O operations required to find an element in the tree.
  2. Often used in the implementation of database indexes.
  3. The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
  4. NTFS uses B+ trees for directory indexing.

Basically it helps for range queries look ups, minimizes tree traversing.

查看更多
太酷不给撩
3楼-- · 2020-05-23 16:37

I'm not clear on what the "delete range" operation is supposed to do. Does it;

  • Delete a previously inserted range, and recompute the merge of the remaining ranges?

  • Stop tracking the deleted range, regardless of how many times parts of it have been added.

That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).

The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.

One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.

Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.

I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.

Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).

查看更多
▲ chillily
4楼-- · 2020-05-23 16:46

You are probably looking for an interval tree.

Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.

Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)

查看更多
兄弟一词,经得起流年.
5楼-- · 2020-05-23 16:50

Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.

If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.

查看更多
登录 后发表回答