D concurrent writing to buffer

Say you have a buffer of size N which must be set to definite values (say to zero, or something else). This value setting in the buffer is divided over M threads, each handling N / M elements of the buffer.

The buffer cannot be immutable, since we change the values. Message passing won't work either, since it is forbidden to pass ref or array (= pointer) types. So it must happen through shared? No, since in my case the buffer elements are of type creal and thus arithmetics are not atomic.

At the end, the main program must wait until all threads are finished. It is given that each thread only writes to a subset of the array and none of the threads have overlap in the array with another thread or in any way depend on eachother.

How would I go about writing to (or modifying) a buffer in a concurrent manner?

PS: sometimes I can simply divide the array in M consecutive pieces, but sometimes I go over the array (the array is 1D but represents 2D data) column-by-column. Which makes the individual arrays the threads use be actually interleaved in the mother-array. Argh.

EDIT: I figured out that the type shared(creal)[] would work, since now the elements are shared and not the array itself. You could parallelize interleaved arrays I bet. There is some disadvantage though:

The shared storage class is so strict, that the allocation must be supplied with the keyword. Which makes it hardly encapsulated; since the caller must supply the array, it is obligated to pass a shared array and can't just generically pass a regular array and let the processing function worry about parallelism. No, the calling function must worry about parallelism too, so that the processing function receives a shared array and needn't reallocate the array into shared space.

There is also a very strange bug, that when I dynamically allocate shared(creal)[] at certain spots, it simply hangs at allocation. Seems very random and can't find the culprit... In the test example this works, but not in my project... This turned out to be a bug in DMD / OptLink.

EDIT2: I never mentioned, but it's for implementing the FFT (Fast Fourier Theorem). So I have no power over selecting precise cache aligned slices. All I know is the elements are of type creal and the number of elements is a power of 2 (per row / column).

you can use the std.parallelism module

T[] buff;
foreach(ref elem;parallel(buff))elem=0;

but if you want to reinvent this you can just use shared (it is thread safe to only let 1 thread accesses a certain element at the time and if you enforce this with the appropriate join() or Task.*force() so much the better)