I recently benchmarked the .NET 4 garbage collector, allocating intensively from several threads. When the allocated values were recorded in an array, I observed no scalability just as I had expected (because the system contends for synchronized access to a shared old generation). However, when the allocated values were immediately discarded, I was horrified to observe no scalability then either!
I had expected the temporary case to scale almost linearly because each thread should simply wipe the nursery gen0 clean and start again without contending for any shared resources (nothing surviving to older generations and no L2 cache misses because gen0 easily fits in L1 cache).
For example, this MSDN article says:
Synchronization-free Allocations On a multiprocessor system, generation 0 of the managed heap is split into multiple memory arenas using one arena per thread. This allows multiple threads to make allocations simultaneously so that exclusive access to the heap is not required.
Can anyone verify my findings and/or explain this discrepancy between my predictions and observations?
Not a complete answer to the question, but just to clear up some misconceptions: the .NET GC is only concurrent in workstation mode. In server mode, it uses stop-the-world parallel GC. More details here. The separate nurseries in .NET are primarily to avoid synchronisation on allocation; they are nevertheless part of the global heap and cannot be collected separately.
Not so sure what this is about and exactly what you saw on your machine. There are however two distinct versions of the CLR on your machine. Mscorwks.dll and mscorsvc.dll. The former is the one you get when you run your program on a work station, the latter on one of the server versions of Windows (like Windows 2003 or 2008).
The work station version is kind to your local PC, it doesn't gobble all machine resources. You can still read your email while a GC is going on. The server version is optimized to scale on server level hardware. Lots of RAM (GC doesn't kick in that quick) and lots of CPU cores (garbage gets collected on more than one core). Your quoted article probably talks about the server version.
You can select the server version on your workstation, use the <gcServer>
element in your .config file.
I can hazard a couple of guesses as to what is happening.
(1) If you have a single thread and there is M space free in generation 0, then the GC will only run once M bytes have been allocated.
(2) If you have N threads and the GC divides up generation 0 into N/M space per thread, the GC will end up running every time a thread allocates N/M bytes. The showstopper here is that the GC needs to "stop the world" (i.e., suspend all running threads) in order to mark references from the threads' root sets. This is not cheap. So, not only will the GC run more often, it will be doing more work on each collection.
The other problem, of course, is that multi-threaded applications aren't typically very cache friendly, which can also put a significant dent in your performance.
I don't think this is a .NET GC issue, rather it's an issue with GC in general. A colleague once ran a simple "ping pong" benchmark sending simple integer messages between two threads using SOAP. The benchmark ran twice as fast when the two threads were in separate processes because memory allocation and management was completely decoupled!
Very quick, easy to see (straight at root, assigning nulls) and massive releases can trick GC into being eager and the whole idea of a cache-local heap is a nice dream :-) Even if you had fully separated thread-local heaps (which you don't) the handle-pointer table would still have to be fully volatile just to make is safe for general multi-CPU scenarios. Oh and remember that there are many threads, CPU cache is shared, kernel needs take the precedence so it's not all just for you :-)
Also beware that "heap" with double pointers has 2 parts - block of memory to give and the handle-pointer table (so that blocks can be moved but your code always has one address). Such table is a critical but very lean process-level resource and just about the only way to stress it is to flood it with massive quick releases - so you managed to do it :-))
In general the rule of GC is - leak :-) Not forever of course, but kind of for as long as you can. If you remember how people go around telling "don't force GC collections"? That's the part of the story. Also the "stop the world" collection is actually much more efficient than "concurrent" and used to be known by a nicer name of cycle stealing or sheduler cooperation. Only the mark phase needs to freeze the scheduler and on a server there's a burst of several threads doing it (N cores are idle anyway :-) The only reason for the other one is that it can make real-time operations like playing videos jittery, just as the longer thread quantum does.
So again if you go competing with infrastructure on short and frequent CPU bursts (small alloc, almost no work, quick release) the only thing you'll see/measure will be the GC and JIT noise.
If this was for something real, i.e. not just experimenting, the best you can do is to use big value arrays on stack (structs). They can't be forced onto heap and are as local as a local can get, and not subject to any backdoor moving => cache has to love them :-) That may mean switching to "unsafe" mode, using normal pointers and maybe doing a bit of alloc on your own (if yopu need something simple like lists) but that's a small price to pay for kicking GC out :-) Trying to force data into a cache also depends on keeping your stacks lean otherwise - rememeber that you are not alone. Also giving your threads some work that's worth at least several quantums berween releases may help. Worst case scenario would be if you alloc and release within a signle quantum.
or explain this discrepancy between my predictions and observations?
Benchmarking is hard.
Benchmarking a subsystem that is not under your full control is even harder.