I was looking at the slave/pool modules and it seems similar to what I
want, but it also seems like I have a single point of failure in my
application (if the master node goes down).
The client has a list of gateways (for the sake of fallback - all do
the same thing) which accept connections, and one is chosen from
randomly by the client. When the client connects all nodes are
examined to see which has the least load and then the IP of the least-
loaded server is forwarded back to the client. The client then
connects to this server and everything is executed there.
In summary, I want all nodes to act as both gateways and to actually
process client requests. The load balancing is only done when the
client initially connects - all of the actual packets and processed on
the client's "home" node.
How would I do this?
I don't know if there is this modules implemented yet but what I can say, load balance is overrated. What I can argue is, random placing of jobs is best bet unless you know far more information how load will come in future and in most of cases you really doesn't. What you wrote:
When the client connects all nodes are examined to see which has the least load and then the IP of the least- loaded server is forwarded back to the client.
How you know that all those least loaded node will not be highest loaded just in next ms? How you know that all those high loaded nodes which you will not include in list will not drop load just in next ms? You really can't know it unless you have very rare case.
Just measure (or compute) your node's performance and set node's probability be chosen depend of it. Choose node randomly regardless of current load. Use this as initial approach. When you set it up, then you can try make up some more sophisticated algorithm. I bet that it will be very hard work to beat this initial approach. Trust me, very hard.
Edit: To be more clear in one subtle detail, I strongly argue that you can't predict future load from current and historical load but you should use knowledge about tasks durations probability and current decomposition of task's lifetime. This work is so hard to try achieve.
The purpose of a supervision tree is to manage the processes not necessarily forward requests. There is no reason you couldn't use different code to send requests directly to members of the list of available processes. See the pool:get_nodes or pool:get_node() functions for one way to get those lists.
You can let the pool module handle the management of the processes (restarting, monitoring, and killing processing) and use some other module to transparently redirect requests to the pool of processes. Maybe you were looking for distributed pools though? It'll be hard to get away from the master process in erlang whithout going to distributed nodes. The whole running system is pretty much one large supervision tree.
I recently remembered the pg module which allows you to setup process groups. messages sent to the group go to every process in the group. It might get you part way toward what you want. you would have to write the code to decide which process handles the request for real but you would get a pool without a master using it.