Atomicity of 32bit read on multicore CPU

2019-02-22 08:08发布

问题:

(Note: I've added tags to this question based on where I feel will people will be who are likely to be able to help, so please don't shout:))

In my VS 2017 64bit project, I have a 32bit long value m_lClosed. When I want to update this, I use one of the Interlocked family of functions.

Consider this code, executing on thread #1

LONG lRet = InterlockedCompareExchange(&m_lClosed, 1, 0);   // Set m_lClosed to 1 provided it's currently 0

Now consider this code, executing on thread #2:

if (m_lClosed) // Do something

I understand that on a single CPU, this will not be a problem because the update is atomic and the read is atomic too (see MSDN), so thread pre-emption cannot leave the variable in a partially updated state. But on a multicore CPU, we really could have both these pieces of code executing in parallel if each thread is on a different CPU. In this example, I don't think that would be a problem, but it still feels wrong to be testing something that is in the process of possibly being updated.

This webpage tells me that atomicity on multiple CPUs is achieved via the LOCK assembly instruction, preventing other CPUs from accessing that memory. That sounds like what I need, but the assembly language generated for the if test above is merely

cmp   dword ptr [l],0  

... no LOCK instruction in sight.

How in an event like this are we supposed to ensure atomicity of the read?

EDIT 24/4/18

Firstly thanks for all the interest this question has generated. I show below the actual code; I purposely kept it simple to focus on the atomicity of it all, but clearly it would have been better if I had showed it all from minute one.

Secondly, the project in which the actual code lives is a VS2005 project; hence no access to C++11 atomics. That's why I didn't add the C++11 tag to the question. I am using VS2017 with a "scratch" project to save having to build the huge VS2005 one every time I make a change whilst I am learning. Plus, its a better IDE.

Right, so the actual code lives in an IOCP driven server, and this whole atomicity is about handling a closed socket:

class CConnection
{
    //...

    DWORD PostWSARecv()
    {
        if (!m_lClosed)
            return ::WSARecv(...);
        else
            return WSAESHUTDOWN;
    }

    bool SetClosed()
    {
        LONG lRet = InterlockedCompareExchange(&m_lClosed, 1, 0);   // Set m_lClosed to 1 provided it's currently 0
        // If the swap was carried out, the return value is the old value of m_lClosed, which should be 0.
        return lRet == 0;
    }

    SOCKET m_sock;
    LONG m_lClosed;
};

The caller will call SetClosed(); if it returns true, it will then call ::closesocket() etc. Please don't question why it is that way, it just is :)

Consider what happens if one thread closes the socket whilst another tries to post a WSARecv(). You might think that the WSARecv() will fail (the socket is closed after all!); however what if a new connection is established with the same socket handle as that which we just closed - we would then be posting the WSARecv() which will succeed, but this would be fatal for my program logic since we are now associating a completely different connection with this CConnection object. Hence, I have the if (!m_lClosed) test. You could argue that I shouldn't be handling the same connection in multiple threads, but that is not the point of this question :)

That is why I need to test m_lClosed before I make the WSARecv() call.

Now, clearly, I am only setting m_lClosed to a 1, so a torn read/write is not really a concern, but it is the principle I am concerned about. What if I set m_lClosed to 2147483647 and then test for 2147483647? In this case, a torn read/write would be more problematic.

回答1:

It really depends on your compiler and the CPU you are running on.

x86 CPUs will atomically read 32-bit values without the LOCK prefix if the memory address is properly aligned. However, you most likely will need some sort of memory barrier to control the CPUs out-of-order execution if the variable is used as a lock/count of some other related data. Data that is not aligned might not be read atomically, especially if the value straddles a page boundary.

If you are not hand coding assembly you also need to worry about the compilers reordering optimizations.

Any variable marked as volatile will have ordering constraints in the compiler (and possibly the generated machine code) when compiling with Visual C++:

The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only. With Visual Studio 2003, volatile to volatile references are ordered; the compiler will not re-order volatile variable access. With Visual Studio 2005, the compiler also uses acquire semantics for read operations on volatile variables and release semantics for write operations on volatile variables (when supported by the CPU).

Microsoft specific volatile keyword enhancements:

When the /volatile:ms compiler option is used—by default when architectures other than ARM are targeted—the compiler generates extra code to maintain ordering among references to volatile objects in addition to maintaining ordering to references to other global objects. In particular:

  • A write to a volatile object (also known as volatile write) has Release semantics; that is, a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

  • A read of a volatile object (also known as volatile read) has Acquire semantics; that is, a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.

This enables volatile objects to be used for memory locks and releases in multithreaded applications.


For architectures other than ARM, if no /volatile compiler option is specified, the compiler performs as if /volatile:ms were specified; therefore, for architectures other than ARM we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives and compiler intrinsics when you are dealing with memory that is shared across threads.

Microsoft provides compiler intrinsics for most of the Interlocked* functions and they will compile to something like LOCK XADD ... instead of a function call.

Until "recently", C/C++ had no support for atomic operations or threads in general but this changed in C11/C++11 where atomic support was added. Using the <atomic> header and its types/functions/classes moves the alignment and reordering responsibility to the compiler so you don't have to worry about that. You still have to make a choice regarding memory barriers and this determines the machine code generated by the compiler. With relaxed memory order, the load atomic operation will most likely end up as a simple MOV instruction on x86. A stricter memory order can add a fence and possibly the LOCK prefix if the compiler determines that the target platform requires it.



回答2:

In C++11, an unsynchronized access to a non-atomic object (such as m_lClosed) is undefined behavior.

The standard provides all the facilities you need to write this correctly; you do not need non-portable functions such as InterlockedCompareExchange. Instead, simply define your variable as atomic:

std::atomic<bool> m_lClosed{false};

// Writer thread...
bool expected = false;
m_lClosed.compare_exhange_strong(expected, true);

// Reader...
if (m_lClosed.load()) { /* ... */ }

This is more than sufficient (it forces sequential consistency, which might be expensive). In some cases it might be possible to generate slightly faster code by relaxing the memory order on the atomic operations, but I would not worry about that.



回答3:

As I posted here, this question was never about protecting a critical section of code, it was purely about avoiding torn read/writes. user3386109 posted a comment here which I ended up using, but declined posting it as an answer here. Thus I am providing the solution I ended up using for completeness of this question; maybe it will help someone in the future.

The following shows the atomic setting and testing of m_lClosed:

long m_lClosed = 0;

Thread 1

// Set flag to closed
if (InterlockedCompareExchange(&m_lClosed, 1, 0) == 0)
    cout << "Closed OK!\n";

Thread 2

This code replaces if (!m_lClosed)

if (InterlockedCompareExchange(&m_lClosed, 0, 0) == 0)
    cout << "Not closed!";


回答4:

OK so as it turns out this really isn't necessary; this answer explains in detail why we don't need to use any interlocked operations for a simple read/write (but we do for a read-modify-write).