Divide key value pairs into equal lists without ac

My question briefly stated: Is there an algorithm one can use to divide key value pairs into roughly equal length lists if one doesn't know apriori the number of values that any key contains, and one can't hold all keys (or counts of their values) in RAM concurrently?

My question with context: I have multiple files that contain key/value pairs, where keys are hashes and values are lists of object ids in which the given hash occurs. The same key appears zero or one times in each of these files, and frequently a given key appears in many of the files.

I am reading those files into several workers running in a compute cluster. Each worker is assigned a subset of the keys. For each key a worker is assigned, the worker accumulates all of the values for the key that occur in any of the previously mentioned key/value files. Each worker then reads all of the previously-mentioned files, finds all values for each of its keys, and writes a single output file to disk.

The trouble I'm facing is that the workers are accumulating wildly different numbers of values among their assigned keys, so their RAM requirements are quite different (from 33GB on the low end to 139GB on the high). Right now, to assign keys to workers, I take a sha1 hash of each key, and if sha1(key) % total_number_of_workers == worker_id (where worker id is a given worker's index position among all workers) then the worker is assigned the given key.

Is there a way to assign keys to workers that will help ensure a more equal distribution of RAM requirements among the nodes? Any advice others can offer on this question would be greatly appreciated!

In case it might be of interest to others, I put together a simple implementation of a k-way merge that Jim Mischel describes below in Python [gist]. This implementation doesn't require one to have all text files in memory concurrently, which may be impossible for large datasets.

It's a simple k-way merge. Let's say you have three files:

File 1     File 2     File 3
A=3        B=7        C=22
X=9        B=4        D=19
Q=33       Z=26       A=2
X=47       X=12       D=13

Now, you sort those files:

Sorted1    Sorted2    Sorted3
A=3        B=7        A=2
Q=33       B=4        C=22
X=9        X=12       D=19
X=47       Z=26       D=13

You could do a merge step and end up with a single file:

A=3
A=2
B=7
B=4
C=22
D=19
D=13
Q=33
X=9
X=47
X=12
Z=26

And then scan through that file, accumulating and writing values.

But you can do the merge and accumulation in a single step. After all, when you do the merge you're outputting things in sorted key order, so all you have to do is insert the accumulation code before the output step.

A single process starts up and creates a priority queue that contains the first item from each file. So the priority queue would contain [A=3, B=7, A=2]. The program takes the smallest key, A=3, from the priority queue, and the queue is refreshed with the next item from the first sorted file. The queue now contains [Q=33,B=7,A=2].

The program creates a new array with key A, containing the value [3]. Then it goes to the queue again and reads the smallest value: A=2. It sees that the key is equal to the one it's working on, so it updates the array to [3,2]. The queue is refreshed from the sorted file, so now it contains [Q=33,B=7,C=22].

Once again, the program gets the smallest key value from the queue. This time it's B. B is not equal to A, so the program outputs A,[3,2], replaces the current key with B, and replaces the accumulation array with [7].

This continues until there are no more items to be merged.

The code to handle refilling the priority queue is a bit fiddly, but not really difficult.

An alternative is to use your operating system's sort utility to sort and merge the files, and then write a simple loop that goes through the single sorted file linearly to accumulate the values.