Implement `memcpy()`: Is `unsigned char *` needed,

2019-07-13 05:30发布

问题:

I was implementing a version of memcpy() to be able to use it with volatile. Is it safe to use char * or do I need unsigned char *?

volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
    const volatile char *src_c  = (const volatile char *)src;
    volatile char *dest_c       = (volatile char *)dest;

    for (size_t i = 0; i < n; i++) {
        dest_c[i]   = src_c[i];
    }

    return  dest;
}

I think unsigned should be necessary to avoid overflow problems if the data in any cell of the buffer is > INT8_MAX, which I think might be UB.

回答1:

In theory, your code might run on a machine which forbids one bit pattern in a signed char. It might use ones' complement or sign-magnitude representations of negative integers, in which one bit pattern would be interpreted as a 0 with a negative sign. Even on two's-complement architectures, the standard allows the implementation to restrict the range of negative integers so that INT_MIN == -INT_MAX, although I don't know of any actual machine which does that.

So, according to §6.2.6.2p2, there may be one signed character value which an implementation might treat as a trap representation:

Which of these [representations of negative integers] applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two [sign-magnitude and two's complement]), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.

(There cannot be any other trap values for character types, because §6.2.6.2 requires that signed char not have any padding bits, which is the only other way that a trap representation can be formed. For the same reason, no bit pattern is a trap representation for unsigned char.)

So, if this hypothetical machine has a C implementation in which char is signed, then it is possible that copying an arbitrary byte through a char will involve copying a trap representation.

For signed integer types other than char (if it happens to be signed) and signed char, reading a value which is a trap representation is undefined behaviour. But §6.2.6.1/5 allows reading and writing these values for character types only:

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation. (Emphasis added)

(The third sentence is a bit clunky, but to simplify: storing a value into memory is a "side effect that modifies all of the object", so it's permitted as well.)

In short, thanks to that exception, you can use char in an implementation of memcpy without worrying about undefined behaviour.

However, the same is not true of strcpy. strcpy must check for the trailing NUL byte which terminates a string, which means it needs to compare the value it reads from memory with 0. And the comparison operators (indeed, all arithmetic operators) first perform integer promotion on their operands, which will convert the char to an int. Integer promotion of a trap representation is undefined behaviour, as far as I know, so on the hypothetical C implementation running on the hypothetical machine, you would need to use unsigned char in order to implement strcpy.



回答2:

Is it safe to use char * or do I need unsigned char *?

Perhaps


"String handling" functions such as memcpy() have the specification:

For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value). C11dr §7.23.1 3

Using unsigned char is the specified "as if" type. Little to be gained attempting others - which may or may not work.


Using char with memcpy() may work, but extending that paradigm to other like functions leads to problems.

A single big reason to avoid char for str...() and mem...() like functions is that sometimes it makes a functional difference unexpectedly.

memcmp(), strcmp() certainly differ with (signed) char vs. unsigned char.

Pedantic: On relic non-2's complement with signed char, only '\0' should end a string. Yet negative_zero == 0 too and a char with negative_zero should not indicate the end of a string.



回答3:

You do not need unsigned.

Like so:

volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
    const volatile char *src_c  = (const volatile char *)src;
    volatile char *dest_c       = (volatile char *)dest;

    for (size_t i = 0; i < n; i++) {
        dest_c[i]   = src_c[i];
    }

    return  dest;
}

Attemping to make a confirming implementation where char has a trap value will eventually lead to a contradiction:

  • fopen("", "rb") does not require use of only fread() and fwrite()
  • fgets() takes a char * as its first argument and can be used on binary files.
  • strlen() finds the distance to the next null from a given char *. Since fgets() is guaranteed to have written one, it will not read past the end of the array and therefore will not trap


回答4:

The unsigned is not needed, but there is no reason to use plain char for this function. Plain char should only be used for actual character strings. For other uses, the types unsigned char or uint8_t and int8_t are more precise as the signedness is explicitly specified.

If you want to simplify the function code, you can remove the casts:

volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n) {
    const volatile unsigned char *src_c = src;
    volatile unsigned char *dest_c = dest;

    for (size_t i = 0; i < n; i++) {
        dest_c[i] = src_c[i];
    }
    return dest;
}