Reducing memory footprint with multiprocessing?

One of my applications runs about 100 workers. It started out as a threading application, but performance (latency) issues were hit. So I converted those workers to multiprocessing.Processes. The benchmark below shows that the reduction in load was achieved at the cost of more memory usage (factor 6).

So where precisely does the memory usage come from if Linux uses cow and the workers do not share any data?

How can I reduce the memory footprint? (Alternative question: How can I reduce the load for threading?)

Benchmarks on Linux 2.6.26, 4 CPUs 2G RAM: (Note that cpu usage is given in % of one cpu, so full load is 400%. The numbers are derived from looking at Munin graphs.)

                  | threading | multiprocessing
------------------+-----------+----------------
memory usage      | ~0.25GB   | ~1.5GB
context switches  | ~1.5e4/s  | ~5e2/s
system cpu usage  | ~30%      | ~3%
total cpu usage   | ~100%     | ~50%
load avg          | ~1.5      | ~0.7

Background: The application is processing events from the network and storing some of them in a MySQL database.

My understanding is that with dynamic languages, like Python, copy-on-write is not as effective as a lot more memory gets written to (and therefore copied) after forking. As the Python interpretor progresses through the program there's a lot more going on than just your code. For example reference-counting - very object will be written too pretty quickly as reference counting needs to write the reference count to memory (triggering a copy).

With that in mind you probably need to have a hybrid threading/processing approach. Have multiple process to take advantage of multiple cores etc, but have each one run multiple threads (so you can deal with the level of concurrency you need). You'll just need to experiment with how many threads vs processes you run.