Prepending to a multi-gigabyte file

2020-08-15 12:12发布

问题:

What would be the most performant way to prepend a single character to a multi-gigabyte file (in my practical case, a 40GB file).

There is no limitation on the implementation to do this. Meaning it can be through a tool, a shell script, a program in any programming language, ...

回答1:

There is no really simple solution. There are no system calls to prepend data, only append or rewrite.

But depending on what you're doing with the file, you may get away with tricks. If the file is used sequentially, you could make a named pipe and put cat onecharfile.txt bigfile > namedpipe and then use "namedpipe" as file. The same can be achieved by cat onecharfile.txt bigfile | program if your program takes stdin as input.

For random access a FUSE filesystem could be done, but probably waay too complicated for this.

If you want to get your hands really dirty, figure out howto

  • allocate a datablock (about inode and datablock structure)
  • insert it into a file's chain as second block (or first and then you're practically done)
  • write the beginning of file into that block
  • write the single character as first in file
  • mark first block as if it uses only one byte of available payload (this is possible for last block, I don't know if it's possible for blocks in middle of file chain).

This has possibilities to majorly wreck your filesystem though, so not recommended; good fun.



回答2:

Let the file have an initial block of null characters. When you prepend a character, read the block, insert the character right-to-left, and write back the block. When the block is full, then do the more expensive full rewrite in order to prepend another null block. That way, you can reduce the number of times by a large factor that you have to do a full rewrite.

Added: Keep the file in two subfiles: A (a short one) and B (a long one). Prepend to A any way you like. When A gets "big enough", prepend A to B (by re-writing), and clear A.

Another way: Keep the file as a directory of small files ..., A000003, A000002, A000001.
Just prepend to the largest-numbered file. When it's big enough, make the next file in sequence.
When you need to read the file, just read them all in descending order.



回答3:

You might be able to invert your implementation depending on your problem: append single characters to the end of your file. When it comes time to read the file, read it in reverse.

Hide this behind enough of an abstraction layer and it may not make a difference to your code how the bytes are physically stored.



回答4:

If you use linux you could try to use a custom version of READ(2) loaded with LD_PRELOAD and have it prepend your data at the first read.

See https://zlibc.linux.lu/zlibc.html for implementation inspiration.



回答5:

if you mean prepend that character to the start of the entire file, one way

$ echo "C" > tmp
$ cat my40gbfile >> tmp
$ mv tmp my40gbfile

or using sed

$ sed -i '1i C' my40gbfile

if you mean prepending the character to every line of the file

$ awk '{print "C"$0}' my40gbfile > temp && mv temp my40gbfile


回答6:

As I understand, this is handled on the file system level, meaning if you prepend data to a file, it effectively rewrites the file. This is the same reason why the ID3 tags in MP3 files are zero padded, so that future updates don't rewrite the entire file, but just update those reserved bytes.

So whichever way you use will give roughly similar results. What you can try is do some tests with a custom copy function, that reads/writes in bigger chunks than the default system copy, say 2MB or 5MB, which might improve performance. Ultimately your disk I/O is the bottleneck here.



回答7:

The absolutely most high-performance way would seem to be to get down into the level of sectors and how the file is actually stored. I'm not sure if the OS then becomes a factor, but the target platform might, anyway it's useful for us to know what you run on.

I think this is a case where C is the obvious choice, this kind of low-level stuff is exactly what a systems programming language is for.

Can you tell us what you end up doing, would be interesting.



回答8:

Here's the Windows command line ("DOS") way:

Put your 1 char into prepend.txt

copy /b prepend.txt + myHugeFile fileNameOfCombinedFile