On a 64 bit machine, can I safely operate on indiv

Background

I am doing parallel operations on rows and columns in images. My images are 8 bit or 16 bit pixels and I'm on a 64 bit machine. When I do operations on columns in parallel, two adjacent columns may share the same 32 bit int or 64 bit long. Basically, I want to know whether I can safely operate on individual bytes of the same quadword in parallel.

Minimal Test

I wrote a minimal test function that I have not been able to make fail. For each byte in a 64 bit long, I concurrently perform successive multiplications in a finite field of order p. I know that by Fermat's little theorem a^(p-1) = 1 mod p when p is prime. I vary the values a and p for each of my 8 threads, and I perform k*(p-1) multiplications of a. When the threads finish each byte should be 1. And in fact, my test cases pass. Each time I run, I get the following output:

8
101010101010101
101010101010101

My system is Linux 4.13.0-041300-generic x86_64 with an 8 core Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz. I compiled with g++ 7.2.0 -O2 and examined the assembly. I added the assembly for the "INNER LOOP" and commented it. It seems to me that the code generated is safe because the stores are only writing the lower 8 bits to the destination instead of doing some bitwise arithmetic and storing to the entire word or quadword. g++ -O3 generated similar code.

Question:

I want to know if this code is always thread-safe, and if not, in what conditions would it not be. Maybe I am being very paranoid, but I feel that I would need to operate on quadwords at a time in order to be safe.

#include <iostream>
#include <pthread.h>

class FermatLTParams
{
public:
    FermatLTParams(unsigned char *_dst, unsigned int _p, unsigned int _a, unsigned int _k)
        : dst(_dst), p(_p), a(_a), k(_k) {}

    unsigned char *dst;
    unsigned int p, a, k;
};

void *PerformFermatLT(void *_p)
{  
    unsigned int j, i;
    FermatLTParams *p = reinterpret_cast<FermatLTParams *>(_p);
    for(j=0; j < p->k; ++j)
    {    
        //a^(p-1) == 1 mod p

        //...BEGIN INNER LOOP
        for(i=1; i < p->p; ++i)
        {
            p->dst[0] = (unsigned char)(p->dst[0]*p->a % p->p);
        }
        //...END INNER LOOP

        /* gcc 7.2.0 -O2  (INNER LOOP)

        .L4:
            movq    (%rdi), %r8             # r8 = dst
            xorl    %edx, %edx              # edx = 0
            addl    $1, %esi                # ++i
            movzbl  (%r8), %eax             # eax (lower 8 bits) = dst[0]
            imull   12(%rdi), %eax          # eax =  a * eax
            divl    %ecx                    # eax = eax / ecx;   edx = eax % ecx    
            movb    %dl, (%r8)              # dst[0] = edx (lower 8 bits)
            movl    8(%rdi), %ecx           # ecx = p
            cmpl    %esi, %ecx              # if (i < p)
            ja      .L4                     #   goto L4
        */

    }
    return NULL;
}

int main(int argc, const char **argv)
{
    int i;
    unsigned long val = 0x0101010101010101; //a^0 = 1
    unsigned int k = 10000000;
    std::cout << sizeof(val) << std::endl;
    std::cout << std::hex << val << std::endl;
    unsigned char *dst = reinterpret_cast<unsigned char *>(&val);
    pthread_t threads[8];
    FermatLTParams params[8] = 
    { 
        FermatLTParams(dst+0, 11, 5, k),
        FermatLTParams(dst+1, 17, 8, k),
        FermatLTParams(dst+2, 43, 3, k),
        FermatLTParams(dst+3, 31, 4, k),
        FermatLTParams(dst+4, 13, 3, k),
        FermatLTParams(dst+5, 7, 2, k),
        FermatLTParams(dst+6, 11, 10, k),
        FermatLTParams(dst+7, 13, 11, k)
    };

    for(i=0; i < 8; ++i)
    {
        pthread_create(threads+i, NULL, PerformFermatLT, params+i);
    }
    for(i=0; i < 8; ++i)
    {
        pthread_join(threads[i], NULL);
    }

    std::cout << std::hex << val << std::endl;
    return 0;
}

The answer is YES, you can safely operate on individual bytes of a 64-bit quadword in parallel, by different threads.

It is amazing that it works, but it would be a disaster if it did not. All hardware acts as if a core writing a byte in its own core marks not just that the cache line is dirty, but which bytes within it. When that cache line (64 or 128 or even 256 bytes) eventually gets written to main memory, only the dirty bytes actually modify the main memory. This is essential, because otherwise when two threads were working on independent data that happened to occupy the same cache line, they would trash each other's results.

This can be bad for performance, because the way it works is partly through the magic of "cache coherency," where when one thread writes a byte all the caches in the system that have that same line of data are affected. If they're dirty, they need to write to main memory, and then either drop the cache line, or capture the changes from the other thread. There are all kinds of different implementations, but it is generally expensive.