-->

How do computers differentiate between letters and

2019-07-19 02:25发布

问题:

I was just curious because 65 is the same as the letter A

If this is the wrong stack sorry.

回答1:

Short answer. They don't. Longer answer, every binary combination between 00000000 and 11111111 has a character representation in the ASCII character set. 01000001 just happens to be the first capital letter in the Latin alphabet that was designated over 30 years ago. There are other character sets, and code pages that represent different letter, numbers, non-printable and accented letters. It's entirely possible that the binary 01000001 could be a lower case z with a tilde over the top in a different character set. 'computers' don't know (or care) what a particular binary representation means to humans.



回答2:

"65 is the same as the letter A": It is true if you say it is. But not saying more than that isn't very useful.

There is no text but encoded text. There are no numbers but encoded numbers. To the CPU, some number encodings are native, everything else is just undifferentiated data.

(Some data is just data for programs, other data is the CPU instructions of programs. It's a security problem if a CPU executes data as instructions inappropriately. Some architectures keep program data and instructions separate.)

Common native number encodings are signed and unsigned integers of 1, 2, 4, and 8 bytes and IEEE-754 single and double precision floating point numbers. Signed integers are usually two's-complement. Multi-byte integers have a byte ordering (or endianness) because on typical machines each byte is individually addressable. If a number encoding is not native, a program library is needed to process such data.

Text is a sequence of encoded characters from a character set. There are hundreds of character sets. A character set is an assignment of a conceptual character to a number called a codepoint. Sometimes the conceptual characters are categorized as lowercase letter, digit, symbol, etc. A codepoint value is mapped to bytes using a character encoding. Most character sets have one encoding, but Unicode has several. Some character sets are subsets of other character sets—such relationships are not generally useful because exactly one character set is used in any one context.

A program is a set of instructions that operate on data. It must apply the correct operations to the right data. So, it is the program that differentiates between text and number, usually by its location or flow path.

Stored data must be in a known layout of encoded text and numbers. Sometimes the layout is stored also. The layout is called metadata. Without the metadata accompanying the data, or being agreed upon, the data cannot be used.

It's all quite simple with appropriate bookkeeping. But there are several methods of bookkeeping so there is no general solution to how to handle data without metadata. Methods include: Well-known and/or registered file extensions, HTTP headers, MIME types, HTML meta charset tag, XML encoding declaration. Some methods only work in a certain context, such as audio/video codecs having a four-character code (FourCC), and unix shell scripts with a shebang. Some methods only help narrow guessing, such as file signatures. Needless to say, guessing should be avoided; it leads to security issues and data loss.

Unfortunately, text files are often without metadata. It is particularly important to agree upon or separately communicate the metadata.

Data without metadata is "binary". So the writer of text must agree with the reader on which character encoding is to be used. Similarly, for all types of data. Here reader and writer are both humans and programs.