Why are DWORD values commonly represented in Hexad

2019-04-06 16:55发布

问题:

I am trying to understand why a DWORD value is often described in Hexadecimal on MSDN.

The reason why I am analyzing this is because I am trying to understand fundamentally why all these different number data types exist. A local mentor alluded to me that the creation of DWORD and other Microsoft types had something to do with the evolution of processors. This gives meaning and context to my understanding of these data types. I would like more context and background.

Either way, I could use some explanation or some resources on how to remember the difference between DWORD, unsigned integers, bytes, bits, WORD, etc.

In summary, my questions are: 1) Why are DWORDs represented in Hex? 2) Can you provide resources on the differences between numerical data types and why they were created?

回答1:

Everything within a computer is a bunch of 0s and 1s. But writing an entire DWORD in binary is quite tedious:

00000000 11111111 00000000 11111111

to save space and improve readability, we like to write it in a shorter form. Decimal is what we're most familiar with, but doesn't map well to binary. Octal and Hexadecimal map quite conveniently, lining up exactly with the binary bits:

// each octal digit is exactly 3 binary digits
01 010 100 binary  =  124 octal

// each hexadecimal digit is exactly 4 binary digits
0101 0100 binary   =  54 hexadecimal

Since hex lines up very nicely with 8-bit Bytes (2 hex digits make a Byte), the notation stuck, and that's what gets used most. It's easier to read, easier to understand, easier to line up when messing around with bitmasks.

The normal shorthand for identifying which base is being used:

  1234543 = decimal
 01234543 = octal (leading zero)
0x1234543 = hexadecimal (starts with 0x)

As for your question about BYTE, WORD, DWORD, etc...

Computers started with a bit. Only 1 or 0. He had a cameo in the original Tron.

Bytes are 8 bits long (well, once upon a time there were 7-bit bytes, but we can ignore those). This allows you to have a number from 0-255, or a signed number from -128 to 127. Better than just 1/0, but still limited. You may have heard references to "8-bit gaming". This is what we refer to. The system was built around Bytes.

Then computers grew to have 16-bit registers. This is 2 Bytes, and became known as a WORD (no, I don't know why). Now, numbers could be 0-65535 or -32768 to 32767.

We continued to want more power, and computers were expanded to 32-bit registers. 4 Bytes, 2 Words, also known as a DWORD (double-word). To this day, you can look in "C:\Windows" and see a directory for "system" (old 16-bit pieces) and "system32" (new 32-bit components).

Then came the QWORD (quad-word). 4 WORDS, 8 Bytes, 64 bits. Ever hear of the Nintendo-64? That's where the name came from. Modern architecture is now here. The internals of the cpu contain 64-bit registers. You can generally run a 32- or 64-bit operating system on such cpus.

That covers Bit, Byte, Word, Dword. Those are raw types, and are used often for flags, bitmasks, etc. If you want to hold an actual number, it's best to use signed/unsigned integer, long, etc.

I didn't cover floating point numbers, but hopefully this helps with the general idea.



回答2:

DWORD constants are typically written in hex when they are used as flags that can be OR'd together in bitwise fashion. It makes it easier to see that is so. That's why you see 0x01, 0x02, 0x04, 0x08, 0x10, 0x20 etc. Programmers just recognise those values as having binary representations with just a single bit set.

When it's an enumeration you'd see 0x01, 0x02, 0x03 etc. They'd often still be written in hex because programmers tend to get into these habits!



回答3:

Just for the record, 16 bit unsigned data are named WORD beacause at the time being, computers had 16 bits registers.

In computer history, 8 bits data where the biggest data you could store on a register. As it could store an ascii character it was commonly called a CHAR.

But 16 bits computer came out and CHAR was not appropriate to name 16 bits data. So 16 bits data was commonly called a WORD because it was the biggest unit of data you could store on one register and it was a good analogy to continue the one made for CHAR.

So, On some computers, using a different CPU WORD commonly refers to the size of the register. On Saturn CPU, wich use 64 bit register, a WORD is 64 bits.

When 32 bits x86 processors came out, WORD stayed 16 bit for compatibility reasons, and the DWORD was created to extend it to 32 bits. The same is true for QWORD and 64 bits.

As for why hexadecimal is commonly used to describe a WORD, it has to do to the nature of the definition of a WORD that is tied to it's register origin. In assembler programming you use hexadecimal to describe data, because processors only know binray intergers, (0 and 1). And hexadecimal is a more compact way to use binary and still keeping some of it's properties.



回答4:

To elaborate on Tim's answer, it's because converting Hex to binary and back is very easy - each Hex digit is 4 binary digits:

0x1 = 0001
0x2 = 0010
...
0xD = 1101
0xE = 1110
0xF = 1111

So, 0x2D = 0010 1101



回答5:

You have very interesing and tricky question.

In short there was two drivers that lead to existing of to competetive type families - DWORD-based and int-based:

1) Desire to have crosspltformity on the one hand and the stricktly size types on the another hand.

2) Peoples conservatism.

In any case to provide ful detailed answer to you question and good enough background of this field we must dig into the computers history. And start our story from the early days of computing.

For the first, there is such a notion as a machine word. Machine word is a stricktly sized chunk of binary data that is natural for processing in the particular processor. So, size of the machine word is hardly processor dependent and in general equal to size of the geneal internal processor registers. Usually it can be subdivided into the two equal parts that also can be accessed by processor as independent chuncks of data. For example, on the x86 processors the machine word size is 32 bits. Thats mean that all general registers (eax, ebx, ecx, edx, esi, edi, ebp, esp and eip) have the same size - 32 bits. But many of them can be accessed as part of the register. For example you can access eax as 32 bit data chunk, ax, as 16 bit data chunk or even al as 8 bit data chunk. But not that physically this is all one 32 bit register. I think that you can found very good background on that field on Wikipedia (http://en.wikipedia.org/wiki/Word_(computer_architecture)). In short, machine word is how much bit data chunk can be used as an integer operand for the single instruction. Even today different processor architectures have different machine word size.

Ok, we have some understanding of the computer word. It is a time to come back into the history of computing. The first Intel x86 processors that was popular had 16 bit word size. It came to the market at 1978. On that time the assembler was highly popular if not a primary programming language. As you know assembler is just a very thin wrapper under the native processor language. Due to this it is enterely hardware dependent. And when Intel push they new 8086 processor into the market, the first thing that they was needed to achive success is to push the assempler for the new processor into the market too. Nobody wants a processor that is nobody know how to program. And when Intel gave the names for the different data types in the assembler for the 8086 they make the evident chois and name 16-bit data chunk as a word, because the machine word of the 8086 have 16-bit size. The half of the machine word was called byte (8-bit) and two words used as one operand was called double word (32-bit). Intel used this terms in the processors manuals and in the assembler mnemonics (db, dw nd dd for statical allocation of byte, word and double word).

Years passed and 1985 Intel moved from 16-bit architecture to the 32-bit architecture with introduction of the 80386 processor. But at that time there was huge number of developers that was accustomed to that word is a 16-bit value. Beside that there was huge amount of soft was written with true belive that word is 16-bit. And many of the already written code rely to the fact that word is 16 bit. Due to this, beside the fact that the machine word size actually was changed, notation stayed the same, except the fact that new data type arrived to the assembler - quad word (64-bit), because the instruction that relies to the two machine words stayed the same, but the machine word was extended. In the same way the double quad word (128-bit) arrived now with 64-bit AMD64 architecture. As result we have

byte    =   8 bit
word    =  16 bit
dword   =  32 bit
qword   =  64 bit
dqword  = 128 bit

Note the main thing in that type family is that it is strongly sized types family. Because it is come from and it heavily used in assembler, that require data types with constant size. Note, years pass ony by one but the data types from this family continue have the same constant size, beside the fact the its name already haven't its original meaning.

On the another hand, in the same time year by year, the high level languages became more and more popular. And because that languges was developed with cross-platform application in the mind thay looked to the sizes of its internal data types from absolutly other point of view. If I correctly understand no one high level language doesn't clearly claim that the some of its internal data types have a fixed constant size that never will be changed in the future. Let't look at the C++ as at the example. The C++ standart tells that:

"The fundamental storage unit in the C++ memory model is the byte. A byte is at 
least large enough to contain any member of the basic execution character set and 
is composed of a contiguous sequence of bits, the number of which is implementa-
tion-defined. The least significant bit is called the low-order bit; the most 
significant bit is called the high-order bit. The memory available to a C++ program
consists of one or more sequences of contiguous bytes. Every byte has a unique 
address."

So, we can see surprising information - in C++ even byte haven't any constant size. So even if we accustomed to think have size - 8 bit, according to C++ can be not only 8 but also 9, 10, 11, 12 etc. bits in size. And maybe even 7 bit.

"There are five signed integer types: “signed char”, “short int”, “int”, and 
“long int”., and “long long int”. In this list, each type provides at least as 
much storage as those preceding it in the list. Plain ints have the natural size
suggested by the architecture of the execution environment; the other signed 
integer types are provided to meet special needs."

That cite describe two main claims:

1) sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long)

2) Plain ints have the natural size suggested by the architecture of the execution environment. That is mean that int must have the machine word size of the target processor architecture.

You can goes through all C++ standard text but you will fail to find something like "size of int is 4 bytes" or "length of long is 64 bit". Size of particular integer C++ types can change with moving from one processor architecture to another, and with moving from one compiler to another. But even when you write program in c++ you will periodically faced with requirenment to use data types with well known constant size.

At least earlier compiler developers followed that standard claims. But now we can see that people conservatism comes into the play one more time. People used to think that int is 32-bit and can store values from the range from –2,147,483,648 to 2,147,483,647. Earlier when industry came through the border between 16-bit and 32-bit architectures. The second claim was strictly enforced. And when you used C++ compiler to create 16-bit program, compiler used int with 16-bit size that is "natural size" for 16-bit processors and, in contrast, when you used another C++ compiler to create 32-bit program but from the same source code, compiler used int with 32-bit size that is "natural size" for 32-bit processors. Nowadays, if you will look, for example, to the Microsoft C++ compiler you will found that it will use 32-bit int regardless of the target processor architecture (32-bit or 64-bit) just because people used to think that int is 32-bit!

As summury, we can see that thare are two data types families - dword-based and int-based. Motivation for the second one is obvious - cross-platform application development. Motivation for the fisrt one is all cases when the taking into the accaunt the sizes of variables have sense. For example, among other we can mention next cases:

1) You need to have some value in predifined well-known range and you need use it class or in another data structure that will populate into huge number of instances in run-time. In that case if you will use int-based types to store that value it will have the drawback in huge memory overhead on some architectures and potentially can broke the logic on another one. For example you need to manipulate values in the range from 0 to 1000000. If you will use int to store it, you program will correctly behave if int will be 32-bit, will have 4-byte memory overhead per each value instance if int will be 64-bit, and won't correctly work if int will be 16-bit.

2) Data involved into the nextworking. To have the ability to correctly handle your networking protocol on different PCs you will need to specify it in plain in size-based format, that will describe all packets and header bit by bit. Your network communication will be completely broken if on one PC you protocol header will be 20 byte length with 32-bit, and on another PC it will be 28 byte length with 64-bit int.

3) Your program need to store value used for some special processor instructions, or your program will communicate with modules or code chunks written in assembler.

4) You need store values that will be used to communicate with devices. Each device have its own specificetion that describes what sort of input device require as input and in what form it will provide output. If device require 16-bit value as input it must receive equally 16-bit value regardless of the int size and even regardless of machine word size used by processor on the system where device is installed.

5) Your algoritm relies to the integer overflow logic. For example you have array of 2^16 entries, and you want infenitely and sequentely goes through it and refresh entries values. If you will use 16-bit int your program will works perfectly, but wimmediatelly you moving to the 32-bit int usage you will have out of range array index access.

Due to this Microsoft use both families of data types. Int-based types in case where actual data size haven't big importance, and DWORD-based in the cases it have. And even in that case Microsoft define both as macroses, to provide the ability quickly and easy enough adopt virtual type system used by Microsoft to the particular processor architecture and/or compiler by assigning to it correct C++ equivalent.

I hope that I have covered the question about the origin of the data types and their differences quite well.

So, we can switch to the seqond question about why hexademical digit is used to denote DWORD-based data types values. There are actually few reasons:

1) If we use stricktly-sized binary data types it will expectable enough that we can want to look on them in the binary form.

2) It is much easy to understand bit masks values when they encoded in binary form. Agree that it is much easier to understang what bit is set and what bit is reset if value in the next form

1100010001011001

then if it will be encoded in next form

50265

3) Data encoded in the binary form and described one DWORD-based value have the constant length, when the same data encoded in teh decimal form will have variable length. Note that even when the small number is encoded in binary form, the full value description is provided

0x00000100

instead of

0x100

This property of the binary encoding is very attractive in the case when the analysis of the huge amount of binary data is required. For example, hex editor or analysis of the plain memory used by your program in debugger when your breakpoint was hit. Agree that it is much more comfortableto look on the neat columns of values that to the heap of weakly aligned variable size values.

So, we decided that we want to use binary encoding. We have three choices: use plain binary encoding, use octal encoding and use hexadecimal encoding. Peple prefere to use hexademical encoding because it most shortest from the set of avaliable encodings. Just compare

10010001101000101011001111000

and

0x1234568

Can you quickly found the numbers of bits that is set in the next value?

00000000100000000000000000000

and in next?

0x00100000

In the second case you can quickly divide the number in four separated bytes

0x00 0x10 0x00 0x00
   3    2    1    0

in each of which the first digit denote 4 most significant bits and the second one denote another 4 least significant bits. After you will spent some time on working with hex values you will remember the plain bit analog of each hexadecimal digit and will replace one by another in the mind without any problems:

0 - 0000  4 - 0100  8 - 1000  C - 1100
1 - 0001  5 - 0101  9 - 1001  D - 1101
2 - 0010  6 - 0110  A - 1010  E - 1110
3 - 0011  7 - 0111  B - 1011  F - 1111

So, we need only second or two to found that we have bit number 20 is set!

People use hex because it is most short, confortable to undestand and use form of binary data encoding.