How to search pattern in big binary files efficien

2020-05-06 19:23发布

问题:

I have several binary files, which are mostly bigger than 10GB. In this files, I want to find patterns with Python, i.e. data between the pattern 0x01 0x02 0x03 and 0xF1 0xF2 0xF3.

My problem: I know how to handle binary data or how I use search algorithms, but due to the size of the files it is very inefficient to read the file completely first. That's why I thought it would be smart to read the file blockwise and search for the pattern inside a block.

My goal: I would like to have Python determine the positions (start and stop) of a found pattern. Is there a special algorithm or maybe even a Python library that I could use to solve the problem?

回答1:

The common way when searching a pattern in a large file is to read the file by chunks into a buffer that has the size of the read buffer + the size of the pattern - 1.

On first read, you only search the pattern in the read buffer, then you repeatedly copy size_of_pattern-1 chars from the end of the buffer to the beginning, read a new chunk after that and search in the whole buffer. That way, you are sure to find any occurence of the pattern, even if it starts in one chunk and ends in next.