I need to assign N entities (each with possible parents and possible children) to M computation nodes while satisfying the following optimization conditions:
- Children of an entity want to be assigned to the same computation node (to maximize data locality among siblings)
- The distribution of entities should be as even as possible (i.e. no overtaxing of a single node).
I'm looking for some suggestions on heuristic methods to solve this problem.
I've read http://en.wikipedia.org/wiki/Assignment%5Fproblem.
Thanks.
I'm not sure whether 1 is a hard requirement. If so, as a first step, you should group your entities into connected components. If not, you should specify what the tradeoff between 1 and 2 is, e.g. as a cost function.
Placing the components on computation nodes is the binpacking problem, if you limit each node to N/M entities. A good approximation is the following algorithm:
Well, obviously you have to predict how many children (or simply total load) each process will have on average. Than you can use classic assignment algorithms, they're usually quite simple.
The most important issue is of course to decide what you want to minimize. Usually we want to minimize lateness (how much "behind schedule" we get), but not always...
Edit: If you know all children/parents in advance, and children of a process must be in the same machine, you can consider a process and all its children to be the same process to begin with. Then you can use a very simple algorithm to minimize whatever you want to minimize.
Edit #2: Read about Job Scheduling to get a better idea