c# Fastest string search in all files

2020-04-07 19:42发布

Problem (Check edit for clarifications)

I have a list of about 1500 strings and for each of these strings I have to check if any of the files in a directory (and subdirectories) contains that string (there are about 4000 files).

Code

What I have now are these two variants:

Original

foreach(var str in stringList)
{
    allFiles.Any(f => File.ReadAllText(f).Contains(str));
}

Second variant (using ReadLines instead of ReadAllText, as suggested from VladL in this question)

foreach(var string in stringList)
{
    allFiles.SelectMany(File.ReadLines).Any(line => line.Contains(str));
}

I only tested the complete program execution of the original variant and it took 21 minutes to finish. I then tested a single statement (check if 1 string is contained in any file) searching for a string that I knew it wasn't contained to check the worst case scenario, and this are my timings (executed each 3 times):

Original: 1285, 1369, 1336 ms

Second variant: 2718, 2804, 2831 ms

I also tryed to replace ReadAllText with ReadAllLines in the Original statement (without changing anything else), but with no performance changes.

Question

Is there any faster way for checking if a string is contained in any file (large amount of large files)?

Edit

I admit I didn't expressed myself as clear as I wanted, by saying I have a list of strings. What I actually have is a list of csv files, I then itarate trhough those and then iterate through each line of these file (ignoring the first line). With each line I create a string composing it with some of the fields of the line, and then look if any file contains that string.

foreach(var csvFile in csvFiles)
{
    var lines = File.ReadAllLines(csvFile);
    foreach(var line in lines)
    {
        if (IsHeader(line)) continue;
        var str = ComposeString(line);
        var bool = allFiles.Any(f => File.ReadAllText(f).Contains(str));
        // do stuff with the line and bool
     }
 }

Edit 2

public void ExecuteAhoCorasick()
{
    var table = CreateDataTable();
    var allFiles = GetAllFiles();
    var csvFiles = GetCsvFiles();
    var resList = new List<string>();

    foreach(var csvFile in csvFiles)
    {
        if (file.Contains("ValueList_")) continue;
        var lines = File.ReadAllLines(file);
        foreach (var line in lines)
        {
            if (line == HeaderLine) continue;
            var res = line.Split(';');
            if (res.Length <= 7) continue;
            var resPath = $"{res[0]}.{res[1]}.{res[2]}".Trim('.');
            resList.Add(resPath);

            var row = table.NewRow();
            row[0] = res[0]; // Group
            row[1] = res[1]; // Type
            row[2] = res[2]; // Key
            row[3] = res[3]; // Global
            row[4] = res[4]; // De
            row[5] = res[5]; // Fr
            row[6] = res[6]; // It
            row[7] = res[7]; // En
            row[8] = resPath; // Resource Path
            row[9] = false;
            row[10] = ""; // Comment
            row[11] = file; // File Path

            table.Rows.Add(row);
        }
    }

    var foundRes = new List<string>();

    foreach (var file in allFiles)
    {
        // var chars = File.ReadLines(file).SelectMany(line => line);
        var text = File.ReadAllText(file);

        var trie = new Trie();
        trie.Add(resList);

        foundRes.AddRange(trie.Find(text));
        // foundRes.AddRange(trie.Find(chars));
    }

    // update row[9] to true foreach res in foundRes
}

5条回答
Summer. ? 凉城
2楼-- · 2020-04-07 20:21

I think the fastest way to do this will be to:

  1. Read each file completely into memory. This simplifies the code.
  2. Use the Aho-Corasick algorithm to search for the keywords in the text for each file.

An implementation of Aho-Corasick is available here.

I've written a simple program using that implementation from Github that tests the worst-case performance (that is, when none of the keywords are present in the text) to compare Aho-Corasick with Contains()):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using ConsoleApp1;

namespace Demo
{
    class Program
    {
        static void Main()
        {
            string[] needles = createNeedles(1500).ToArray();
            string haystack = createHaystack(100000);

            var sw = Stopwatch.StartNew();
            anyViaContains(needles, haystack);
            Console.WriteLine("anyViaContains() took " + sw.Elapsed);

            sw.Restart();
            anyViaAhoCorasick(needles, haystack);
            Console.WriteLine("anyViaAhoCorasick() took " + sw.Elapsed);
        }

        static bool anyViaContains(string[] needles, string haystack)
        {
            return needles.Any(haystack.Contains);
        }

        static bool anyViaAhoCorasick(string[] needles, string haystack)
        {
            var trie = new Trie();
            trie.Add(needles);
            trie.Build();
            return trie.Find(haystack).Any();
        }

        static IEnumerable<string> createNeedles(int n)
        {
            for (int i = 0; i < n; ++i)
                yield return n + "." + n + "." + n;
        }

        static string createHaystack(int n)
        {
            var sb = new StringBuilder();

            for (int i = 0; i < n; ++i)
                sb.AppendLine("Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text");

            return sb.ToString();
        }
    }
}

The results I get for a 64-bit RELEASE build (run outside the debugger) are as follows:

anyViaContains() took 00:00:09.8216836

anyViaAhoCorasick() took 00:00:00.4002765

For this test case, it appears that Aho-Corasick is around 25 times faster than using Contains(). However, this is a somewhat artificial test case and your actual results may vary. You should instrument with your actual data to see if it really does help.

Note that you can actually avoid loading the entire file into memory when using the Aho-Corasick implementation, because its Find() method accepts an IEnumerable<char>.

You can turn a the contents of a file into an IEnumerable<char> as follows:

var chars = File.ReadLines(filename).SelectMany(line => line);

That actually removes all the newline characters, which is probably OK for your application. If you wanted to keep the newline characters, you'd have to put them back like so:

IEnumerable<char> newline = Enumerable.Repeat('\n', 1);
var chars = File.ReadLines(filename).SelectMany(line => Enumerable.Concat(line, newline));

It would be worth comparing loading each file completely into memory with enumerating the chars in each file (as above) to see if there is any difference. For very large files, it may be important to avoid loading their entire contents into memory.

查看更多
The star\"
3楼-- · 2020-04-07 20:23

I just recently came across a similar problem to yours. Represent each searchable file as the following:

public class SearchableFile {
    private readonly HashSet<string> _uniqueLines;
    //private readonly HashSet<string> _uniqueString;

    public string FilePath { get; }

    public SearchableFile(string filePath) {
        _uniqueLines = new HashSet<string>(File.ReadAllLines(filePath));
        //↑You can also split each line if you have many repeating words in each line.
        //_uniqueString = File.ReadAllLines(filePath).SelectMany(singleLine => singleLine.Split(' '));
        FilePath = filePath;
    }

    public bool ContainsCompositeString(string compositeString) {
        return _uniqueLines.Any(singleLine => singleLine.Contains(compositeString));
        //return _uniqueString.Contains(compositeString);
    }
}

Then you can use it as it is:

    private static void Main(string[] args) {
        var filePaths = new List<string> { "c://temp.txt" };

        foreach (var filePath in filePaths) {
            FilesOnHdd.Add(new SearchableFile(filePath));
        }
        var csvFiles = new List<string> { "c://temp.csv" };
        foreach (var csvFile in csvFiles) {
            var lines = File.ReadAllLines(csvFile);
            foreach (var line in lines) {
                if (IsHeader(line)) {
                    continue;
                }
                var str = ComposeString(line);

                foreach (var singleFileOnHdd in FilesOnHdd) {
                    var result = singleFileOnHdd.ContainsCompositeString(str);
                    if (result) {
                        // do stuff with the line and bool
                    }
                }
            }
        }
    }
查看更多
甜甜的少女心
4楼-- · 2020-04-07 20:32

Does file contains any of the string?

private static bool ContainsLine(string file, List<string> wordsToFind) {
  return File
    .ReadLines(file)
    .Any(line => wordsToFind.Any(word => line.Contains(word))); 
}

Do we have any file which contain any string?

bool result = allFiles
  .AsParallel() // worth trying: we have a lot of files to be proceed
  .Any(file => ContainsLine(file, stringList));

Edit: Often .AsParallel() is worth trying for the problems like this (many files to test), however, in case AsParallel() doesn't bring any gain, just comment it out:

bool result = allFiles
  //.AsParallel() // comment out in case of no gain
  .Any(file => ContainsLine(file, stringList));
查看更多
家丑人穷心不美
5楼-- · 2020-04-07 20:33

You are reading the all the files for each string.

How about to do it the other way around? loop once through all the files:

bool exists = 
    allFiles.SelectMany(File.ReadLines).Any(l=> stringList.Any(str=> l.Contains(str));

Ater the OP edit:

As you've mentioned in the comment, You should first cummulate all the strings from the CSV files and then continue as suggested:

var stringList =
  csvFiles.SelectMany(f=>File.ReadAllLines(f).Where(l=>!IsHeader(l)).Select(ComposString))
          .ToList();

You might want to use also a .Distinct in case there's a chance that some of these words are not unique, to make it faster. But that depends on the size of this list, and how many words are indeed repeating themselves.

var stringList =
  csvFiles.SelectMany(f=>File.ReadAllLines(f).Where(l=>!IsHeader(l)).Select(ComposString))
          .Distinct()
          .ToList();
查看更多
仙女界的扛把子
6楼-- · 2020-04-07 20:37

This strongly depends on your exact use-case. If you are trying to match whole words, that can be discriminated easily, you could build some sort of hashed index (e.g. Dictionary<string, WhatEver>) you can search easily. Anyway - depending on the size - this can be very RAM intensive.

The following code will give an idea how to structure this

class FileReference
{
    // elided 

    string File { get; } // may be set in constructor
    IEnumerable<int> Indices { get; } // will get the contents of _index

    public void Add(int index)
    {
        _indices.Add(index);
    }
}

class ReferenceIndex
{
    Dictionary<string, FileReference> _fileReferences = new Dictionary<string, FileReference>();

    public void Add(string fileName, string index)
    {
        if(!_fileReferences.ContainsKey(fileName))
        {
            _fileReferences.Add(fileName, new FileReference(fileName));
        }
        _fileReferences[fileName].Add(index);
    }

    // elided
}

FileReference tracks the indices of the a string within a single file, ReferenceIndex holds the FileReferences for a single string. For Dictionary<TKey, TValue> is hashed, accessing it is blazing fast. You can use these classes to build up a Dictionary<string, ReferenceIndex> that keeps track of all strings in the files and file references to these strings

Dictionary<string, ReferenceIndex> stringIndex = BuildIndex(fileName);
foreach(var s in searchStrings)
{
    if(stringIndex.ContainsKey(s))
    {
        // do something
    }
}
查看更多
登录 后发表回答