-->

Understanding `read, write` system calls in Unix

2020-08-03 04:20发布

问题:

My Systems Programming project has us implementing a compression/decompression program to crunch down ASCII text files by removing the zero top bit and writing the output to a separate file, depending on whether the compression or decompression routine is working. To do this, the professor has required us to use the binary files and Unix system calls, which include open, close, read, write, etc.

From my understanding of read and write, it reads the binary data by defined byte chunks. However, since this data is binary, I'm not sure how to parse it.

This is a stripped down version of my code, minus the error checking:

void compress(char readFile[]){

  char buffer[BUFFER] //buffer size set to 4096, but tunable to system preference
  int openReadFile;
  openReadFile= open(readFile, O_RDONLY);
}

If I use read to read the data into buffer, will the data in buffer be in binary or character format? Nothing I've come across addresses that detail, and its very relevant to how I parse the contents.

回答1:

read() will read the bytes in without any interpretation (so "binary" mode).

Being binary, and you want to access the individual bytes, you should use a buffer of unsigned char unsigned char buffer[BUFFER]. You can regard char/unsigned char as bytes, they'll be 8 bits on linux.

Now, since what you're dealing with is 8 bit ascii compressed down to 7 bit, you'll have to convert those 7 bits into 8 bits again so you can make sense of the data.

To explain what's been done - consider the text Hey .That's 3 bytes. The bytes will have 8 bits each, and in ascii that's the bit patterns :

01001000 01100101 01111001

Now, removing the most significant bit from this, you shift the remaining bits one bit to the left.

X1001000 X1100101 X1111001

Above, X is the bit to removed. Removing those, and shifting the others you end up with bytes with this pattern:

10010001 10010111 11001000

The rightmost 3 bits is just filled in with 0. So far, no space is saved though. There's still 3 bytes. With a string of 8 bytes, we'd saved 1 byte as that would compress down to 7 bytes.

Now you have to do the reverse on the bytes you've read back in



回答2:

I'll quote the manual of the fopen function (that is based on the open function/primitive) from http://www.kernel.org/doc/man-pages/online/pages/man3/fopen.3.html

The mode string can also include the letter 'b' either as a last character or as a character between the characters in any of the two-character strings described above. This is strictly for compatibility with C89 and has no effect; the 'b' is ignored on all POSIX conforming systems, including Linux

So even the high level function ignores the mode :-)



回答3:

It will read the binary content of the file and load it in the memory buffer points to. Of course, a byte is 8 bits, and that's why a char is 8 bits, so, if the file was a regular plain text document you'll end up with a printable string (be careful with how it ends, read returns the number of bytes (characters in a ascii-encoded plain text file) read).

Edit: in case the file you're reading isn't a text file, and is a collection of binary representations, you can make the type of the buffer the one of the file, even if it's a struct.