Do the strict aliasing rules in C++20 allow `reint

2019-07-11 04:25发布

Do the C++20's strict aliasing rules [basic.lval]/11 arbitrarily allow following...

  1. cast between char* and char8_t*
string str = "string";
u8string u8str { (char8_t*) &*str.data() }; // c++20 u8string

u8string u8str2 = u8"zß水                

3条回答
戒情不戒烟
2楼-- · 2019-07-11 04:55

C-style cast is not the same thing as reinterpret_cast.

The standard sections I think are relevant to your question:

6.7.1.9: Type char8_­t denotes a distinct type whose underlying type is unsigned char. Types char16_­t and char32_­t denote distinct types whose underlying types are uint_­least16_­t and uint_­least32_­t, respectively, in .

7.2.1.11: If a program attempts to access the stored value of an object through a glvalue whose type is not similar ([conv.qual]) to one of the following types the behavior is undefined:

1. the dynamic type of the object,

2. a type that is the signed or unsigned type corresponding to the dynamic type of the object, or

3. a char, unsigned char, or std::byte type.

  1. char8_t*-->char* Yes.
    Because char is one of the types that all objects can be converted to. But the standard does not guarantee that the (dereferenced) converted values are equal for distinct types. char can be signed or not and char8_t is unsigned. char8_t*-->unsigned char* is valid but should not guarantee that either because it's still distinct. But given that it's char8_t's underlying type it should be, I guess?
  2. char*-->char8_t* No.
    As per 6.7.1.9 those types are distinct. Although there might be argument made that "whose underlying type is unsigned char" part could apply with unsigned char being explicitly allowed in 7.2.1.11.3 but I don't think that would be the correct interpretation and being distinct should be the deciding factor. That is supported by the following quote of a comment in the proposal P0482R6 - char8_t: A type for UTF-8 characters and strings (Revision 6 - 2018-11-09) (I did not find more recent revision):

    Finally, processing of UTF-8 strings is currently subject to an optimization pessimization due to glvalue expressions of type char potentially aliasing objects of other types. Use of a distinct type that does not share this aliasing behavior may allow for further compiler optimizations.

  3. uint32_t*<-->char32_t*, uint16_t*<-->char16_t*, uint16_t*<-->uint_least16_t*, uint32_t*<-->uint_least32_t*, uint_least32_t<-->char32_t, uint_least16_t<-->char16_t: No.
    Those pairs are all distinct, so 7.2.1.11.1 does not apply and neither type is in 7.2.1.11.3 so not even the second part of 2. can be relevant.

  4. unsigned char*-->char8_t* No.
    By the same argument as in 2. It's not T*->T* cast which is obviously allowed.

  5. char8_t*-->unsigned char* Yes.
    Because unsigned char is too one of the allowed types per 7.2.1.11.3 . But I would still argue that the standard does not guarantee that the (dereferenced) converted values will equal. But given that it's char8_t's underlying type it doesn't have any other options other than to be equal, I guess?

查看更多
Root(大扎)
3楼-- · 2019-07-11 05:01

Just so we are on the same page, the C-style casts of (T*) expression are equivalent to reinterpret_cast<T*>(expression) ([expr.cast]/4.4), which is equivalent to static_cast<T*>(static_cast<void*>(expression)) ([expr.reinterpret.cast]/7). This does nothing to the value of the pointer, as they are not pointer-interconvertible. (See [expr.static.cast]/13 and [basic.compound]/4).

So yes, we would have to look at [basic.lval]/11 to see if it can be aliased. The reference must have a type which is similar to:

  • the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
  • a char, unsigned char, or std::byte type.

Which is not the case. Even though char8_t has the underlying type of unsigned char, it is not a similar type.

So, for example:

unsigned char uc = 'a';

// Represents address of uc
unsigned char* uc_ptr = &uc;

// Still holds the address of uc, not a char8_t
char8_t* c8_ptr = reinterpret_cast<char8_t*>(uc_ptr);

char8_t c8 = *c8_ptr;  // UB, as `char8_t` is not `cv unsigned char`.

Though because of [basic.fundamentals]/6, which says:

A fundamental type specified to have a signed or unsigned integer type as its underlying type has the same object representation [...]

You can do reinterpret_cast<unsigned char*>(pointer-to-char8_t) and have all the values be equal, but that is the only case (And also char* iff char is unsigned, otherwise they may compare unequal (Even for values < 128)). For all other types, you can use this rule to memcpy:

// Assuming std::is_same_v<uint32_t, uint_least32_t>
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str(ui32vec.size(), U'\x00');
std::memcpy(u32str.data(), ui32vec.data(), ui32vec.size() * sizeof(uint32_t));

u32string u32str2 = U"zß水                                                                    
查看更多
4楼-- · 2019-07-11 05:07

The char*_t line of types do not have any special aliasing rules. Therefore, the standard rules apply. And those rules do not have exceptions for conversion between underlying types.

So most of what you did is UB. The one case that isn't UB is char due to its special nature. You can in fact read the bytes of a char8_t as an array of char. But you can't do the opposite, reading the bytes of a char array as char8_t.

Now, these types are completely convertible to each other. So you can convert the values in those array to the other type anytime you want.

All that being said, on real implementations those things will almost certainly work. Well, until they don't, because you tried to change one thing through a thing that it's not supposed to be changed by, and the compiler doesn't reload the changed value because it assumed that it couldn't have been changed. So really, just use the correct, meaningful type.

查看更多
登录 后发表回答