I have some concurrent code which has an intermittent failure and I've reduced the problem down to two cases which seem identical, but where one fails and the other doesn't.
I've now spent way too much time trying to create a minimal, complete example that fails, but without success, so I'm just posting the lines that fail in case anyone can see an obvious problem.
Object lock = new Object();
struct MyValueType { readonly public int i1, i2; };
class Node { public MyValueType x; public int y; public Node z; };
volatile Node[] m_rg = new Node[300];
unsafe void Foo()
{
Node[] temp;
while (true)
{
temp = m_rg;
/* ... */
Monitor.Enter(lock);
if (temp == m_rg)
break;
Monitor.Exit(lock);
}
#if OK // this works:
Node cur = temp[33];
fixed (MyValueType* pe = &cur.x)
*(long*)pe = *(long*)&e;
#else // this reliably causes random corruption:
fixed (MyValueType* pe = &temp[33].x)
*(long*)pe = *(long*)&e;
#endif
Monitor.Exit(lock);
}
I have studied the IL code and it looks like what's happening is that the Node object at array position 33 is moving (in very rare cases) despite the fact that we are holding a pointer to a value type within it.
It's as if the CLR doesn't notice that we are passing through a heap (movable) object--the array element--in order to access the value type. The 'OK' version has never failed under extended testing on an 8-way machine, but the alternate path fails quickly every time.
- Is this never supposed to work, and 'OK' version is too streamlined to fail under stress?
- Do I need to pin the object myself using GCHandle (I notice in the IL that the fixed statement alone is not doing so)?
- If manual pinning is required here, why is the compiler allowing access through a heap object (without pinning) in this way?
note: This question is not discussing the elegance of reinterpreting the blittable value type in a nasty way, so please, no criticism of this aspect of the code unless it is directly relevant to the problem at hand.. thanks
[edit: jitted asm] Thanks to Hans' reply, I understand better why the jitter is placing things on the stack in what otherwise seem like vacuous asm operations. See [rsp + 50h] for example, and how it gets nulled out after the 'fixed' region. The remaining unresolved question is whether [cur+18h] (lines 207-20C) on the stack is somehow sufficient to protect the access to the value type in a way that is not adequate for [temp+33*IntPtr.Size+18h] (line 24A).
[edit]
summary of conclusions, minimal example
Comparing the two code fragments below, I now believe that #1 is not ok, whereas #2 is acceptable.
(1.) The following fails (on x64 jit at least); GC can still move the MyClass instance if you try to fix it in-situ, via an array reference. There's no place on the stack for the reference of the particular object instance (the array element that needs to be fixed) to be published, for the GC to notice.
struct MyValueType { public int foo; };
class MyClass { public MyValueType mvt; };
MyClass[] rgo = new MyClass[2000];
fixed (MyValueType* pvt = &rgo[1234].mvt)
*(int*)pvt = 1234;
(2.) But you can access a structure inside a (movable) object using fixed (without pinning) if you provide an explicit reference on the stack which can be advertised to the GC:
struct MyValueType { public int foo; };
class MyClass { public MyValueType mvt; };
MyClass[] rgo = new MyClass[2000];
MyClass mc = &rgo[1234]; // <-- only difference -- add this line
fixed (MyValueType* pvt = &mc.mvt) // <-- and adjust accordingly here
*(int*)pvt = 1234;
This is where I'll leave it unless someone can provide corrections or more information...