I have implemented PARLANSE, a language under MS Windows that uses cactus stacks to implement parallel programs. The stack chunks are allocated on a per-function basis and are just the right size to handle local variables, expression temp pushes/pops, and calls to libraries (including stack space for the library routines to work in). Such stack frames can be as small as 32 bytes in practice and often are.
This all works great unless the code does something stupid and causes a hardware trap... at which point Windows appears to insist on pushing the entire x86 machine context "on the stack". This is some 500+ bytes if you include the FP/MMX/etc. registers, which it does. Naturally, a 500 byte push on a 32 byte stack smashes things it should not. (The hardware pushes a few words on a trap, but not the entire context).
[EDIT 11/27/2012: See this for measured details on the rediculous amount of stack Windows actually pushes]
Can I get Windows to store the exception context block someplace else (e.g., to a location specific to a thread)? Then the software could take the exception hit on the thread and process it without overflowing my small stack frames.
I don't think this is possible, but I thought I'd ask a much larger audience. Is there an OS standard call/interface that can cause this to happen?
It would be trivial to do in the OS, if I could con MS into letting my process optionally define a context storage location, "contextp", which is initialized to enable the current legacy behavior by default. Then replacing the interrrupt/trap vector codee:
hardwareint: push context
mov contextp, esp
... with ...
hardwareint: mov <somereg> contextp
test <somereg>
jnz $2
push context
mov contextp, esp
jmp $1
$2: store context @ somereg
$1: equ *
with the obvious changes required to save somereg, etc.
[What I do now is: check the generated code for each function. If it has a chance of generating a trap (e.g., divide by zero), or we are debugging (possible bad pointer deref, etc.), add enough space to the stack frame for the FP context. Stack frames now end up being ~~ 500-1000 bytes in size, programs can't recurse as far, which is sometimes a real problem for the applicaitons we are writing. So we have a workable solution, but it complicates debugging]
EDIT Aug 25: I've managed to get this story to a Microsoft internal engineer who has the authority apparantly to find out who in MS might actually care. There might be faint hope for a solution.
EDIT Sept 14: MS Kernal Group Architect has heard the story and is sympathetic. He said MS will consider a solution (like the one proposed) but unlikely to be in a service pack. Might have to wait for next version of Windows. (Sigh...I might grow old...)
EDIT: Sept 13, 2010 (1 year later). No action on Microsoft's part. My latest nightmare: does taking a trap running a 32 bit process on Windows X64, push the entire X64 context on the stack before the interrupt handler fakes pushing a 32 bit context? That'd be even larger (twice as many integer registers twice as wide, twice as many SSE registers(?))?
EDIT: February 25, 2012: (1.5 years have gone by...) No reaction on Microsoft's part. I guess they just don't care about my kind of parallelism. I think this is a disservice to the community; the "big stack model" used by MS under normal circumstance limits the amount of parallel computations one can have alive at any one instant by eating vast amounts of VM. The PARLANSE model will let one have an application with a million live "grains" in various states of running/waiting; this really occurs in some of our applications where a 100 million node graph is processed "in parallel". The PARLANSE scheme can do this with about 1Gb of RAM, which is pretty manageable. If you tried that with MS 1Mb "big stacks" you'd need 10^12 bytes of VM just for the stack space and I'm pretty sure Windows won't let you manage a million threads.
EDIT: April 29, 2014: (4 years have gone by). I guess MS just doesn't read SO. I've done enough engineering on PARLANSE so we only pay the price of large stack frames during debugging or when there are FP operations going on, so we've managed to find very practical ways to live with this. MS has continued to disappoint; the amount of stuff pushed on the stack by various versions of Windows seems to vary considerably and egregiously above and beyond the need for just the hardware context. There's some hint that some of this variability is caused by non-MS products sticking (e.g. antivirus) sticking their nose in the exception handling chain; why can't they do that from outside my address space? Any, we handle all this by simply adding a large slop factor for FP/debug traps, and waiting for the inevitable MS system in the field that exceeds that amount.