Poorly-balanced socket accepts with Linux 3.2 kern

I am running a fairly large-scale Node.js 0.8.8 app using Cluster with 16 worker processes on a 16-processor box with hyperthreading (so 32 logical cores). We are finding that since moving to the Linux 3.2.0 kernel (from 2.6.32), the balancing of incoming requests between worker child processes seems be heavily weighted to 5 or so processes, with the other 11 not doing much work at all. This may be more efficient for throughput, but seems to increase request latency and is not optimal for us because many of these are long-lived websocket connections that can start doing work at the same time.

The child processes are all accepting on a socket (using epoll), and while this problem has a fix in Node 0.9 (https://github.com/bnoordhuis/libuv/commit/be2a2176ce25d6a4190b10acd1de9fd53f7a6275), that fix does not seem to help in our tests. Is anyone aware of kernel tuning parameters or build options that could help, or are we best-off moving back to the 2.6 kernel or load balancing across worker processes using a different approach?

We boiled it down to a simple HTTP Siege test, though note that this is running with 12 procs on a 12-core box with hyperthreading (so 24 logical cores), and with 12 worker processes accepting on the socket, as opposed to our 16 procs in production.

HTTP Siege with Node 0.9.3 on Debian Squeeze with 2.6.32 kernel on bare metal:

Same everything except with the 3.2.0 kernel:

Don't depend on the OS's socket multiple accept to balance load across web server processes.

The Linux kernels behavior differs here from version to version, and we saw a particularly imbalanced behavior with the 3.2 kernel, which appeared to be somewhat more balanced in later versions. e.g. 3.6.

We were operating under the assumption that there should be a way to make Linux do something like round-robin with this, but there were a variety of issues with this, including:

Linux kernel 2.6 showed something like round-robin behavior on bare metal (imbalances were about 3-to-1), Linux kernel 3.2 did not (10-to-1 imbalances), and kernel 3.6.10 seemed okay again. We did not attempt to bisect to the actual change.
Regardless of the kernel version or build options used, the behavior we saw on a 32-logical-core HVM instance on Amazon Web services was severely weighted toward a single process; there may be issues with Xen socket accept: https://serverfault.com/questions/272483/why-is-tcp-accept-performance-so-bad-under-xen

You can see our tests in great detail on the github issue we were using to correspond with the excellent Node.js team, starting about here: https://github.com/joyent/node/issues/3241#issuecomment-11145233

That conversation ends with the Node.js team indicating that they are seriously considering implementing explicit round-robin in Cluster, and starting an issue for that: https://github.com/joyent/node/issues/4435, and with the Trello team (that's us) going to our fallback plan, which was to use a local HAProxy process to proxy across 16 ports on each server machine, with a 2-worker-process Cluster instance running on each port (for fast failover at the accept level in case of process crash or hang). That plan is working beautifully, with greatly reduced variation in request latency and a lower average latency as well.

There is a lot more to be said here, and I did NOT take the step of mailing the Linux kernel mailing list, as it was unclear if this was really a Xen or a Linux kernel issue, or really just an incorrect expectation of multiple accept behavior on our part.

I'd love to see an answer from an expert on multiple accept, but we're going back to what we can build using components that we understand better. If anyone posts a better answer, I would be delighted to accept it instead of mine.