Fast Search to see if a String Exists in Large Fil

2019-01-22 16:55发布

问题:

I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.

If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:

  var
    FileContents: TStringlist;

  begin
    FileContents.LoadFromFile(Filepath);
    if Pos(TextToFind, FileContents.Text) = 0 then
      Found := false
    else 
      Found := true;

The above code is simple, and it generally works okay. But it has two problems:

  1. It fails for very large files (e.g. 300 MB)

  2. I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?

I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.

So how can I speed this search up and also make it work for very large files?


Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?


Solution:

Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.

But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.

So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.

That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).

So I'm off and running. Thanks once again SO community.

回答1:

This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.



回答2:

The best approach here is probably to use memory mapped files.

First you need a file handle, use the CreateFile windows API function for that.

Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.

To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)

To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.

Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.

Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).

EDIT:

A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.

Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.

If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).

Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.



回答3:

If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.

To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.

Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.



回答4:

May I suggest a component ? If yes I would recommend ATStreamSearch. It handles ANSI and UNICODE (and even EBCDIC and Korean and more).

Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.



回答5:

It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.

The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.

This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.

If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.



回答6:

If the files are to be searched multiple times, it could be a good idea to use a word index.

This is called "Full Text Search".

It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.

You have the exact parser you need in The Delphi Magazine Issue 78, February 2002: "Algorithms Alfresco: Ask A Thousand Times Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."

There are several FTS implementation for Delphi:

  • Rubicon
  • Mutis
  • ColiGet
  • Google is your friend..

I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such. We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.