How to delete parts from a binary file in C++

2019-03-02 13:31发布

问题:

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.

What I would like to do:

  1. Search for a ANSI string "something"
  2. Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
  3. I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.

Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?

How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here: How to look for an ANSI string in a binary file?

How can I delete n bytes and write it out to a new file efficiently?

OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.

回答1:

There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.

Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str



回答2:

For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.



回答3:

First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.

As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.

I am going to assume you will load into memory or memory map it for the following.

You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.

To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.

If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.