ASCII strings and endianness

2019-01-21 06:23发布

站内文章 / C

37 0

神经病院院长

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

An intern who works with me showed me an exam he had taken in computer science about endianness issues. There was a question that showed an ASCII string "My-Pizza", and the student had to show how that string would be represented in memory on a little endian computer. Of course, this sounds like a trick question because ASCII strings are not affected by endian issues.

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

I know this can't be right. There is no way an ASCII string would be represented like that on any machine. But apparently, the professor is insisting on this. So, I wrote up a small C program and told the intern to give it to his professor.

#include <string.h>
#include <stdio.h>

int main()
{
    const char* s = "My-Pizza";
    size_t length = strlen(s);
    for (const char* it = s; it < s + length; ++it) {
        printf("%p : %c\n", it, *it);
    }
}

This clearly demonstrates that the string is stored as "My-Pizza" in memory. A day later, the intern gets back to me and tells me the professor is now claiming that C is automagically converting the addresses to display the string in proper order.

I told him his professor is insane, and this is clearly wrong. But just to check my own sanity here, I decided to post this on stackoverflow so I could get others to confirm what I'm saying.

So, I ask : who is right here?

回答1:

Without a doubt, you are correct.

ANSI C standard 6.1.4 specifies that string literals are stored in memory by "concatenating" the characters in the literal.

ANSI standard 6.3.6 also specifies the effect of addition on a pointer value:

When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression.

If the idea attributed to this person were correct, then the compiler would also have to monkey around with integer math when the integers are used as array indices. Many other fallacies would also result which are left to the imagination.

The person may be confused, because (unlike a string initializer), multi-byte chacter constants such as 'ABCD' are stored in endian order.

There are many reasons a person might be confused about this. As others have suggested here, he may be misreading what he sees in a debugger window, where the contents have been byte-swapped for readability of int values.

回答2:

The professor is confused. In order to see something like 'P-yM azzi' you need to take some memory inspection tool that displays memory in '4-byte integer' mode and at the same time gives you a "character interpretation" of each integer in higher-order byte to lower-order byte mode.

This, of course, has nothing to do with the string itself. And to say that the string itself is represented that way on a little-endian machine is utter nonsense.

回答3:

The professor is wrong if we're talking about a system that uses 8 bits per character.

I often work with embedded systems that actually use 16-bit characters, each word being little-endian. On such a system, the string "My-Pizza" would indeed be stored as "yMP-ziaz".

But as long as it's an 8-bit-per-character system, the string will always be stored as "My-Pizza" independent of the endian-ness of the higher-level architecture.

回答4:

Endianness defines the order of bytes within multi-byte values. Character strings are arrays of single-byte values. So each value (character in the string) is the same on both little-endian and big-endian architectures, and endianness does not affect the order of values in a structure.

回答5:

You can quite easily prove that the compiler is doing no such "magic" transformations, by doing the printing in a function that doesn't know it's been passed a string:

int foo(const void *mem, int n)
{
    const char *cptr, *end;
    for (cptr = mem, end = cptr + n; cptr < end; cptr++)
        printf("%p : %c\n", cptr, *cptr);
}

int main()
{
    const char* s = "My-Pizza";

    foo(s, strlen(s));
    foo(s + 1, strlen(s) - 1);
}

Hell, you can even compile to assembly with gcc -S and conclusively determine the absence of magic.

回答6:

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

It would be represented as, represented as what? represented to user as 32bit integer dump? or represented/layout in computer's memory as P-yM azzi?

If the professor said "My-Pizza" would be represented/layout as "P-yM azzi" in computer's memory because the computer is of little endian architecture, somebody, please, got to teach that professor how to use a debugger! I think that's where all the professor's confusions stems from, I have an inkling that the professor is not a coder(not that I'm looking down upon the professor), I think he don't have a way to prove in code what he learned about endian-ness.

Maybe the professor learned the endian-ness stuff just about a week ago, then he just use a debugger incorrectly, quickly delighted about his newly unique insight on computers and then preach it to his students immediately.

If the professor said endian-ness of machine has a bearing on how ascii strings would be represented in memory, he need to clean up his act, somebody should correct him.

If the professor gave an example instead on how integers are represented/layout in machines differently depending on machine's endianness, his students could appreaciate what he is teaching all about.

回答7:

I assume the professor was trying to make a point by analogy about the endian/NUXI problem, but you're right when you apply it to actual strings. Don't let that derail from the fact that he was trying to teach students a point and how to think about a problem a certain way.

回答8:

You may be interested, it is possible to emulate a little-endian architecture on a big-endian machine, or vice-versa. The compiler has to emit code which auto-magically messes with the least significant bits of char* pointers whenever it dereferences them: on a 32bit machine you'd map 00 <-> 11 and 01 <-> 10.

So, if you write the number 0x01020304 on a big-endian machine, and read back the "first" byte of that with this address-munging, then you get the least significant byte, 0x04. The C implementation is little-endian even though the hardware is big-endian.

You need a similar trick for short accesses. Unaligned accesses (if supported) may not refer to adjacent bytes. You also can't use native stores for types bigger than a word because they'd appear word-swapped when read back one byte at a time.

Obviously however, little-endian machines do not do this all the time, it's a very specialist requirement and it prevents you using the native ABI. Sounds to me as though the professor thinks of actual numbers as being "in fact" big-endian, and is deeply confused what a little-endian architecture really is and/or how its memory is being represented.

It's true that the string is "represented as" P-yM azzi on 32bit l-e machines, but only if by "represented" you mean "reading the words of the representation in order of increasing address, but printing the bytes of each word big-endian". As others have said, this is what some debugger memory views might do, so it is indeed a representation of the contents of the memory. But if you're going to represent the individual bytes, then it is more usual to list them in order of increasing address, no matter whether words are stored b-e or l-e, rather than represent each word as a multi-char literal. Certainly there is no pointer-fiddling going on, and if the professor's chosen representation has led him to think that there is some, then it has misled him.

回答9:

Also, (And I haven't played with this in a long time, so I might be wrong) He might be thinking of pascol, where strings are represented as "packed arrays" which, IIRC are characters packed into 4 byte integers?

回答10:

AFAIK, endianness only makes sense when you want to break a large value into small ones. Therefore I don't think that C-style string are affected with it. Because they are after all just arrays of characters. When you are reading only one byte, how could it matter if you read it from left or right?

回答11:

It's hard to read the prof's mind and certainly the compiler is not doing anything other than storing bytes to adjacent increasing addresses on both BE and LE systems, but it is normal to display memory in word-sized numbers, for whatever the word size is, and we write one thousand as 1,000. Not 000,1.

$ cat > /tmp/pizza
My-Pizza^D
$ od -X /tmp/pizza
0000000 502d794d 617a7a69
0000010
$

For the record, y == 79, M == 4d.

回答12:

I came across this and felt the need to clear it up. No one here seems to have addressed the concept of bytes and words or how to address them. A byte is 8-bits. A word is a collection of bytes.

If the computer is:

byte addressable
with 4-byte (32-bit) words
word aligned
the memory is viewed "physically" (not dumped and byte-swapped)

then indeed, the professor would be correct. His failure to indicate this proves he doesn't exactly know what he is talking about, but he did understand the basic concept.

Byte Order Within Words: (a) Big Endian, (b) Little Endian

Character and Integer Data in Words: (a) Big Endian, (b) Little Endian

References

Intel® Fortran Compiler XE 13.0 User and Reference Guides

回答13:

Does the professor's "C" code look anything like this? If so, he needs to update his compiler.

main() {
    extrn putchar;
    putchar('Hell');
    putchar('o, W');
    putchar('orld');
    putchar('!*n');
}

标签： c ascii endianness

神经病院院长

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~