Unlike barrier()
(which I think I understand), mem_fence()
does not affect all items in the work group. The OpenCL spec says (section 6.11.10), for mem_fence()
:
Orders loads and stores of a work-item executing a kernel.
(so it applies to a single work item).
But, at the same time, in section 3.3.1, it says that:
Within a work-item memory has load / store consistency.
so within a work item the memory is consistent.
So what kind of thing is mem_fence()
useful for? It doesn't work across items, yet isn't needed within an item...
Note that I haven't used atomic operations (section 9.5 etc). Is the idea that mem_fence()
is used in conjunction with those? If so, I'd love to see an example.
Thanks.
The spec, for reference.
Update: I can see how it is useful when used with barrier()
(implicitly, since the barrier calls mem_fence()
) - but surely there must be more, since it exists separately?
To try to put it more clearly (hopefully),
mem_fence()
waits until all reads/writes to local and/or global memory made by the calling work-item prior to mem_fence() are visible to all threads in the work-group.
That comes from: http://developer.download.nvidia.com/presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf
Memory operations can be reordered to suit the device they are running on. The spec states (basically) that any reordering of memory operations must ensure that memory is in a consistent state within a single work-item. However, what if you (for example) perform a store operation and value decides to live in a work-item specific cache for now until a better time presents itself to write through to local/global memory? If you try to load from that memory, the work-item that wrote the value has it in its cache, so no problem. But other work-items within the work-group don't, so they may read the wrong value. Placing a memory fence ensures that, at the time of the memory fence call, local/global memory (as per the parameters) will be made consistent (any caches will be flushed, and any reordering will take into account that you expect other threads may need to access this data after this point).
I admit it is still confusing, and I won't swear that my understanding is 100% correct, but I think it is at least the general idea.
Follow Up:
I found this link which talks about CUDA memory fences, but the same general idea applies to OpenCL:
http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf
Check out section B.5 Memory Fence Functions.
They have a code example that computes the sum of an array of numbers in one call. The code is set up to compute a partial sum in each work-group. Then, if there is more summing to do, the code has the last work-group do the work.
So, basically 2 things are done in each work-group: A partial sum, which updates a global variable, then atomic increment of a counter global variable.
After that, if there is any more work left to do, the work-group that incremented the counter to the value of ("work-group size" - 1) is taken to be the last work-group. That work-group goes on to finish up.
Now, the problem (as they explain it) is that, because of memory re-ordering and/or caching, the counter may get incremented and the last work-group may begin to do its work before that partial sum global variable has had its most recent value written to global memory.
A memory fence will ensure that the value of that partial sum variable is consistent for all threads before moving past the fence.
I hope this makes some sense. It is confusing.
The fence ensures that loads and/or stores issued before the fence will complete before any loads and/or stores issued after the fence. No sinc is implied by the fences alone. The barrier operation supports a read/write fence in one or both memory spaces as well as blocking until all work items in a giver workgroup reach it.
This is how I understand it (I'm still trying to verify it)
memory_fence
will only make sure the memory is consistent and visible to all threads in the group, i.e. the execution does NOT stop, until there is another memory transaction (local or global). Which means if there is a move instruction or an add instruction after a memory_fence
, the device will continue to execute these "non-memory transaction" instructions.
barrier
on the other hand will stop execution, period. And will only proceed after all threads reach that point AND all the memory transactions have been cleared.
In other words, barrier
is a superset of mem_fence
. barrier
can prove more expensive in terms of performance than mem_fence
.