My question briefly stated: Is there an algorithm one can use to divide key value pairs into roughly equal length lists if one doesn't know apriori the number of values that any key contains, and one can't hold all keys (or counts of their values) in RAM concurrently?
My question with context: I have multiple files that contain key/value pairs, where keys are hashes and values are lists of object ids in which the given hash occurs. The same key appears zero or one times in each of these files, and frequently a given key appears in many of the files.
I am reading those files into several workers running in a compute cluster. Each worker is assigned a subset of the keys. For each key a worker is assigned, the worker accumulates all of the values for the key that occur in any of the previously mentioned key/value files. Each worker then reads all of the previously-mentioned files, finds all values for each of its keys, and writes a single output file to disk.
The trouble I'm facing is that the workers are accumulating wildly different numbers of values among their assigned keys, so their RAM requirements are quite different (from 33GB on the low end to 139GB on the high). Right now, to assign keys to workers, I take a sha1 hash of each key, and if sha1(key) % total_number_of_workers == worker_id
(where worker id is a given worker's index position among all workers) then the worker is assigned the given key.
Is there a way to assign keys to workers that will help ensure a more equal distribution of RAM requirements among the nodes? Any advice others can offer on this question would be greatly appreciated!
In case it might be of interest to others, I put together a simple implementation of a k-way merge that Jim Mischel describes below in Python [gist]. This implementation doesn't require one to have all text files in memory concurrently, which may be impossible for large datasets.