How do cores decide which cache line to invalidate

I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu. Now let's say we have 2-threaded application in which the following happens.

mov [addr], dword 0xAC763F
;starting Thread 1 and Thread 2

Now let`s say the threads perform the following actions in parallel:

Thread 1                        Thread 2
mov rax, [addr]              mov rax, [addr]
mov [addr], dword 1     mov [addr], dword 2

Now in my understanding of what's going on is this:

Before starting the main thread writes to the corresponding cache line (addr) and marks it as Exclusive.
If both of the threads Thread 1 and Thread 2 finished reading before wrtining is started then the cache line has the state Shared in all caches.

Now I don't understand the line of which cache is marked as Invalid if both of mov [addr], dword 1 in Thread 1 and mov [addr], dword 2 in Thread 2 is happened "at the same time".

First of all "at the same time" feels a bit blurred. I think of this as "during the same CPU clock cycle". How does MESI protocol implemetation solves this "the same time writing from different threads problem".

I think of this as "during the same CPU clock cycle"

Different cores can use different clocks; e.g. one core can be at 4GHz while another is at 800MHz. (Only for Haswell Xeon, though; the dual / quad-core parts have all cores in a single clock domain. I've read that, and it matches what you see from looking at CPU frequency on idle cores while one core is busy.)

Related: What happens when different CPU cores write to the same RAM address without synchronization? is a very similar question. (But the OP of that question doesn't know what MESI is). Still, I went into more detail there about sending out RFO requests, so maybe read that answer if this one is too terse.

Before starting the main thread writes to the corresponding cache line (addr) and marks it as Exclusive.

A core has to get the cache line in Exclusive state before it can modify it. Actually committing a write to L1D flips it from Exclusive to Modified with no communication with other cores. (L1D and L2 are write-back caches).

But yes, if both cores read the cache line before either of them writes, they will both have the line in Shared state. They can only flip the line to Exclusive after receiving a successful reply to an RFO request for that line. Wikipedia's MESI article has a diagram of state transitions, and a definition of RFO.

It's certainly possible for conflicting RFO requests to be in flight at once. They take many cycles to arrive at another core, so there's plenty of time for stores on different cores to each initiate an RFO before either receives the RFO. (Not that that would stop a core from sending its own RFO; writing to either an Invalid of Shared line needs an RFO to get it to Exclusive state so the store can commit.)

I'm not 100% sure that the decision of which request wins would be decided in L3 cache. But

Haswell's L3 is inclusive, and used as a backstop / filter for coherency traffic. Instead of actually broadcasting all requests to all cores, L3 is tag-inclusive with extra info to track which cores (might) have copies of which line. L1 and L2 are private per-core, so L3 is the first shared level of cache.

I'm guessing that L3 handles arbitration for which core's RFO completes first, because it's already keeping track of which cores (might) need to see which RFOs. Presumably this is done in the slice of L3 that holds the relevant physical address.

As @fuz points out, MESI is designed around a bus topology, not a more complex network where messages are routed. Intel's design has the same states, but doesn't literally have to work as simply as the usual CPU-architecture descriptions say.

So, what we can say for sure is: via some unknown internal mechanism, the CPU decides that one RFO was first. A later one arriving while the first one is still doing a round-trip might be cancelled (so that core has to retry later), or it might be buffered.

We do know that Intel CPUs have a hardware arbitration mechanism for contending atomic RMW operations, like lock add [mem], eax. Presumably it's exactly the same mechanism that arbitrates multiple write-only accesses to the same line, because the only difference is that a locked operation holds onto the line for the duration of the operation, not responding to invalidate requests for the duration.

You can talk about multiple RFO requests arriving at the same slice of L3 in the same clock cycle of the "uncore" clock that the L3 uses.

That is probably possible in practice on Haswell; cores are connected with a bi-directional ring bus (32-bytes wide each way), so two messages per (uncore) clock cycle can arrive at any given slice of L3 cache. Also, each slice of L3 is connected to a core, so a request from that core could also arrive at the same time.

In that case, it's probably simple: if a slice even can receive multiple messages destined for it (rather than just passing through on the ring) in the same cycle, it's probably hard-wired so one of the three sources for that slice always wins.