I need a way to quickly check if an IP address falls into one of many forbidden IP ranges.
I currently use iptables to check if the IP falls into a specified range. This works fine for several thousand ranges, but that number is about to increase drastically to several hundred thousand, and will continue to grow.
The other issue with my current method of simply adding new rules to iptables is the increasing number of duplicates.
I need an efficient method for checking if an IP or range falls into an existing (larger) range before it's added to the ruleset.
Ruby is the language I am most familiar with, but what data structures would be the best options for the ever growing number of ranges?
One solution I came up with is to use Redis sets or perhaps MongoDB to store individual IPs as integers, then simply check if the IP exists in the set... but my gut tells me there has to be a smarter way.
If I were to convert IPs into integers and store the ranges, what would be the optimal way to run through the ranges to see if a new IP or range might already be encompassed by an existing bigger range?
Final note: speed is more important than memory costs.
Contrary to the previous poster, I don't think you can get O(log n) complexity by using naive indexing. Let's consider mongodb for instance. You could define two indexes (for start and end properties of the ranges), but mongodb will only use one to solve a given query. So it will not work. Now if you use a single compound index involving both start and end properties of the ranges, the complexity will be logarithmic to find the first range to check, but then it will get linear to find the last range matching the query. The worst case complexity is O(n), and you have it when all the stored ranges overlap your input.
On a side note, using Redis sorted set you can emulate a sorted index (with O(log n) complexity) if you know what you do. Redis is a bit more than a simple key-value store.
Redis sorted sets are implemented using a skip list, and both the score and value are used to compare items.
To solve this kind of problem, a dedicated indexing structure is needed. You may want to have a look at:
http://en.wikipedia.org/wiki/Segment_tree
or
http://en.wikipedia.org/wiki/Interval_tree
If the concern is speed over space, then it may be interesting to flatten the index.
For instance, let's consider the following ranges (using only integers to simplify the discussion):
A 2-8
B 4-6
C 2-9
D 7-10
A sparse structure indexing non overlapping segments can be built.
0 []
2 [A C]
4 [A C B]
7 [A C D]
9 [C D]
10 [D]
11 []
Each entry contains the lower bound of a non overlapping segment as the key, and the list or set of matching ranges as a value. Entries should be indexed using a sorted container (tree, skip list, btree, etc ...)
To find the ranges matching 5, we look for the first entry which is lower or equal to 5 (it will be 4 in this example) and the list of ranges is provided ( [A C B] )
With this data structure, the complexity of the queries is really O(log n). However it is not trivial (and expensive) to build and maintain it. It can be implemented with both mongodb and Redis.
Here is an example with Redis:
> rpush range:2 2-8 2-9
(integer) 2
> rpush range:4 2-8 2-9 4-6
(integer) 3
> rpush range:7 2-8 2-9 7-10
(integer) 3
> rpush range:9 2-9 7-10
(integer) 2
> rpush range:10 7-10
(integer) 1
> zadd range_index 0 range:0 2 range:2 4 range:4 7 range:7 9 range:9 10 range:10
(integer) 6
> zrevrangebyscore range_index 5 0 LIMIT 0 1
1) "range:4"
> lrange range:4 0 -1
1) "2-8"
2) "2-9"
3) "4-6"
If you are dealing with ranges, mongodb will be better than redis for that. If you are dealing with specific IP addresses, redis will be the best choice.
Why?
Mongodb can build indexes on the start and end ranges of ip addresses which you can query in O(log n) time whereas redis is simply a key value store.
You could use a redis hash if every single ip that you wanted to check against was in a hash table and look them up in O(1) time but because you are using ranges I would say mongo is your best bet. I don't think you are going to get better than log n time with any of the data structures in the redis toolkit. O(n) with sorted set or list maybe but not O(log n).