How can a file contain null bytes?

2019-03-17 08:22发布

问题:

How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

For example, if I run this shell code:

$ printf "Hello\00, World!" > test.txt
$ xxd test.txt
0000000: 4865 6c6c 6f00 2c20 576f 726c 6421       Hello., World!

I see a null byte in test.txt (at least in OS X). If C uses null-terminating strings, and OS X is written in C, then how come the file isn't terminated at the null byte, resulting in the file containing Hello instead of Hello\00, World!? Is there a fundamental difference between files and strings?

回答1:

Null-terminated strings are a C construct used to determine the end of a sequence of characters intended to be used as a string. String manipulation functions such as strcmp, strcpy, strchr, and others use this construct to perform their duties.

But you can still read and write binary data that contains null bytes within your program as well as to and from files. You just can't treat them as strings.

Here's an example of how this works:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    FILE *fp = fopen("out1","w");
    if (fp == NULL) {
        perror("fopen failed");
        exit(1);
    }

    int a1[] = { 0x12345678, 0x33220011, 0x0, 0x445566 };
    char a2[] =  { 0x22, 0x33, 0x0, 0x66 };
    char a3[] = "Hello\x0World";

    // this writes the whole array
    fwrite(a1, sizeof(a1[0]), 4, fp);
    // so does this
    fwrite(a2, sizeof(a2[0]), 4, fp);
    // this does not write the whole array -- only "Hello" is written
    fprintf(fp, "%s\n", a3);
    // but this does
    fwrite(a3, sizeof(a3[0]), 12, fp);
    fclose(fp);
    return 0;
}

Contents of out1:

[dbush@db-centos tmp]$ xxd out1
0000000: 7856 3412 1100 2233 0000 0000 6655 4400  xV4..."3....fUD.
0000010: 2233 0066 4865 6c6c 6f0a 4865 6c6c 6f00  "3.fHello.Hello.
0000020: 576f 726c 6400                           World.

For the first array, because we use the fwrite function and tell it to write 4 elements the size of an int, all the values in the array appear in the file. You can see from the output that all values are written, the values are 32-bit, and each value is written in little-endian byte order. We can also see that the second and fourth elements of the array each contain one null byte, while the third value being 0 has 4 null bytes, and all appear in the file.

We also use fwrite on the second array, which contains elements of type char, and we again see that all array elements appear in the file. In particular, the third value in the array is 0, which consists of a single null byte that also appears in the file.

The third array is first written with the fprintf function using a %s format specifier which expects a string. It writes the first 5 bytes of this array to the file before encountering the null byte, after which it stops reading the array. It then prints a newline character (0x0a) as per the format.

The third array it written to the file again, this time using fwrite. The string constant "Hello\x0World" contains 12 bytes: 5 for "Hello", one for the explicit null byte, 5 for "World", and one for the null byte that implicitly ends the string constant. Since fwrite is given the full size of the array (12), it writes all of those bytes. Indeed, looking at the file contents, we see each of those bytes.

As a side note, in each of the fwrite calls, I've hardcoded the size of the array for the third parameter instead of using a more dynamic expression such as sizeof(a1)/sizeof(a1[0]) to make it more clear exactly how many bytes are being written in each case.



回答2:

Null-terminated strings are certainly not the only thing that you can put into a file. Operating system code does not consider a file to be a vehicle for storing null-terminated strings: an operating system presents a file as a collection of arbitrary bytes.

As far as C is concerned, I/O APIs exist for writing files in binary mode. Here is an example:

char buffer[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
FILE *f = fopen("data.bin","wb");  // "w" is for write, "b" is for binary
fwrite(buffer, 1, sizeof(buffer), f);

This C code creates a file called "data.bin", and writes ten bytes into it. Note that although buffer is a character array, it is not a null-terminated string.



回答3:

Because a file is just a stream of bytes, of any byte including null byte. Some files are called text files when they only contain a subset of all the possible bytes: the printable ones (roughly alphanumeric, spaces, punctuation).

C strings are sequence of bytes terminated by a null byte, just a matter of convention. They are too often the source of confusion; just a sequence terminated by null, means any non-null byte terminated by null is a correct C string! Even one that contains a non printable byte, or a control char. Be careful because your example is not a C one! In C printf("dummy\000foo"); will never print foo as printf will consider the C string starting at d and ending at the null byte in the middle. Some compilers complains about such a C string literal.

Now there is no direct link in between C strings (that generally also contains only printable char) and text file. While printing a C string into a file generally consists in storing only its subsequence of non null bytes.



回答4:

While null-bytes are used to terminate strings and needed for string manipulation functions (so they know where the string ends), in binary files \0 bytes can be everywhere.

Consider a binary file with 32-bit numbers for example, they will all contain null-bytes if their values are smaller than 2^24 (for example: 0x001a00c7, or 64-bit 0x0000000a00001a4d).

Idem for Unicode-16 where all ASCII characters have a leading or trailing \0, depending on their endianness, and strings need to end with \0\0.

A lot of files even have blocks padded (to 4kB or even 64kB) with \0 bytes, to have quick access to the desired blocks.

For even more null-bytes in a file, take a look at sparse files, where all bytes are \0 by default, and blocks full of null-bytes aren't even stored on disk to save space.



回答5:

Consider the usual C function calls for writing data to files — write(2):

ssize_t
write(int fildes, const void *buf, size_t nbyte);

… and fwrite(3):

size_t
fwrite(const void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);

Neither of these functions accept a const char * NUL-terminated string. Rather, they take an array of bytes (a const void *) with an explicit size. These functions treat NUL bytes just like any other byte value.



回答6:

Before answering anything, please note that

(note: according to n.m. (see comment's in OP) "a Byte is the smallest quantity available to write out to disk with the C standard library, non-standard libraries may well deal with bits or anything else." So what I said below about WORD sizes being the smallest quantity is probably not very true, but still provides insight nonetheless).

NULL is always 0_decimal (practically)

dec: 0
hex: 0x00000000
bin: 00000000 00000000 00000000 00000000

although it's actual value is defined by a programming language's specification, so use defined constant NULL instead of hardcoding 0 everywhere (in case it changes, when hell freezes over).

ASCII encoding for character '0' is 48_decimal

dec: 48
hex: 0x00000030
bin: 00000000 00000000 00000000 00110000

The concept of NULL doesn't exist in a file, but within the generating app's programming language. Just the numeric encoding/value of NULL exists in a file.

How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

With the above stated this question becomes, how can a file contain 0? The answer is now trivial.

For example, if I run this shell code:

$ printf "Hello\00, World!" 
test.txt $ xxd test.txt 0000000: 4865
6c6c 6f00 2c20 576f 726c 6421            Hello., World!

I see a null byte in test.txt (at least in OS X). If C uses null-terminating strings, and OS X is written in C, then how come the file isn't terminated at the null byte, resulting in the file containing Hello instead of Hello\00, World!?

Is there a fundamental difference between files and strings?

Assuming an ASCII character encoding (1-byte/8-bit characters in the decimal range of 0 and 127):

  • Strings are buffers/char-arrays of 1 byte characters (where NULL = 0_decimal and '0' = 48_decimal)).
  • Files are sequences of either 32-bit or 64-bit "WORDS" (depends on OS and hardware, ie x86 or x64 respectively).

Therefore, a 32-bit OS file that contains only ASCII strings will be a sequence of 32-bit (4-byte) words that range between the decimal values 0 and 127, essentially using only the first byte of the 4-byte word (b2: base-2, decimal is base-10 and hex base-16, fyi)

  0_b2: 00000000 00000000 00000000 00000000
 32_b2: 00000000 00000000 00000000 00100000
 64_b2: 00000000 00000000 00000000 01000000
 96_b2: 00000000 00000000 00000000 01100000
127_b2: 00000000 00000000 00000000 11111111
128_b2: 00000000 00000000 00000001 00000000

Weather this byte is left-most or right-most depends on the OS's endianness.

But to answer your question about the missing NULL after Hello\00, World! I'm going to assume that it was substituted by the EOL/EOF (end of file) value, which is most likely non-printable and is why your not seeing it in the output window.

Note: I'm sure modern OS's (and classic Unix based systems) optimize the storage of ASCII characters, so that 1 word (4 bytes) can pack in 4 characters. Things change with UTF however, since these encodings use more bits to store characters, since they have larger alphabets/character sets to represent (like 50k Kanji/Japanese characters). I think UTF-8 is analogus to ASCII, and renamed for uniformity (with UTF-16 and UTF-32).

Note: C/C++ does in fact "pack" 4 characters into a single 4-byte word using character arrays (ie, strings). Since each char is 1-byte, the compiler will allocate and treat it as 1-byte, arithmetically, on the stack or heap. So if you declare an array in a function (ie, an auto-variable), like so

char[] str1[7] = {'H','e','l','l','o','!','\0'};

where the function stack begins at address 1000_b10 (base-10/decimal), then ya have:

072 101 108 108 111 033

addr  char        binary   decimal
----  ----------- -------- -------
1000: str1[0] 'H' ‭01001000‬ (072)
1001: str1[1] 'e' ‭01100101‬ (101)
1002: str1[2] 'l' ‭01101100‬ (108)
1003: str1[3] 'l' ‭01101100‬ (108)
1004: str1[4] 'o' ‭01101111‬ (111)
1005: str1[5] '!' ‭00100001‬ (033)
1006: str1[6] '0' 00000000 (000)

Since RAM is byte-addressable, every address references a single byte.