The Problem
I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.
Are there any non O(n) approaches?
Is there a non-Ruby i.e. C/C++ implementation?
Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.
My concern is that Ruby is too slow to do this kind of work.
The initial approach
This was the first solution I tried. What was appealing about it was:
- it didn't need to loop slavishly across the hash
- it didn't need to manage a counter to allocate the members evenly among the shards.
- it's short and neat looking
Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
A better solution
Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)
def split_shard(s)
shard1 = {}
shard2 = {}
t = Benchmark.measure do
n = 0
pivot = s.size / 2
s.each_pair do |k,v|
if n < pivot
shard1[k] = v
else
shard2[k] = v
end
n += 1
end
end
$b += t.real
$e += s.size
return shard1, shard2
end
The results
In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.
My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s
Mark's approach runs in just over 5s
That's a significant win
Take away
Don't be fooled by 'intuition' - measure the performance of competing algorithm
Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.
Thanks to Mark and Mike who both get points from me for their help.
Thanks!
I don't see how you can achieve this using an unmodified "vanilla" Hash - I'd expect that you'd need to get into the internals in order to make partitioning into some kind of bulk memory-copying operation. How good is your C?
I'd be more inclined to look into partitioning instead of creating a Hash in the first place, especially if the only reason for the 200K-item Hash existing in the first place is to be subdivided.
EDIT: After thinking about it at the gym...
The problem with finding some existing solution is that someone else needs to have (a) experienced the pain, (b) had the technical ability to address it and (c) felt community-friendly enough to have released it into the wild. Oh, and for your OS platform.
What about using a B-Tree instead of a Hash? Hold your data sorted by key and it can be traversed by memcpy(). B-Tree retrieval is O(log N), which isn't much of a hit against Hash most of the time.
I found something here which might help, and I'd expect there'd only be a little duck-typing wrapper needed to make it quack like a Hash.
Still gonna need those C/C++ skills, though. (Mine are hopelessly rusty).
This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?
I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.
Additional
Hi Chris, I just compared your approach to using Hash#select:
require 'benchmark'
It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:
Hope this helps.