Problem (Check edit for clarifications)
I have a list of about 1500 strings and for each of these strings I have to check if any of the files in a directory (and subdirectories) contains that string (there are about 4000 files).
Code
What I have now are these two variants:
Original
foreach(var str in stringList)
{
allFiles.Any(f => File.ReadAllText(f).Contains(str));
}
Second variant (using ReadLines instead of ReadAllText, as suggested from VladL in this question)
foreach(var string in stringList)
{
allFiles.SelectMany(File.ReadLines).Any(line => line.Contains(str));
}
I only tested the complete program execution of the original variant and it took 21 minutes to finish. I then tested a single statement (check if 1 string is contained in any file) searching for a string that I knew it wasn't contained to check the worst case scenario, and this are my timings (executed each 3 times):
Original: 1285, 1369, 1336 ms
Second variant: 2718, 2804, 2831 ms
I also tryed to replace ReadAllText with ReadAllLines in the Original statement (without changing anything else), but with no performance changes.
Question
Is there any faster way for checking if a string is contained in any file (large amount of large files)?
Edit
I admit I didn't expressed myself as clear as I wanted, by saying I have a list of strings. What I actually have is a list of csv files, I then itarate trhough those and then iterate through each line of these file (ignoring the first line). With each line I create a string composing it with some of the fields of the line, and then look if any file contains that string.
foreach(var csvFile in csvFiles)
{
var lines = File.ReadAllLines(csvFile);
foreach(var line in lines)
{
if (IsHeader(line)) continue;
var str = ComposeString(line);
var bool = allFiles.Any(f => File.ReadAllText(f).Contains(str));
// do stuff with the line and bool
}
}
Edit 2
public void ExecuteAhoCorasick()
{
var table = CreateDataTable();
var allFiles = GetAllFiles();
var csvFiles = GetCsvFiles();
var resList = new List<string>();
foreach(var csvFile in csvFiles)
{
if (file.Contains("ValueList_")) continue;
var lines = File.ReadAllLines(file);
foreach (var line in lines)
{
if (line == HeaderLine) continue;
var res = line.Split(';');
if (res.Length <= 7) continue;
var resPath = $"{res[0]}.{res[1]}.{res[2]}".Trim('.');
resList.Add(resPath);
var row = table.NewRow();
row[0] = res[0]; // Group
row[1] = res[1]; // Type
row[2] = res[2]; // Key
row[3] = res[3]; // Global
row[4] = res[4]; // De
row[5] = res[5]; // Fr
row[6] = res[6]; // It
row[7] = res[7]; // En
row[8] = resPath; // Resource Path
row[9] = false;
row[10] = ""; // Comment
row[11] = file; // File Path
table.Rows.Add(row);
}
}
var foundRes = new List<string>();
foreach (var file in allFiles)
{
// var chars = File.ReadLines(file).SelectMany(line => line);
var text = File.ReadAllText(file);
var trie = new Trie();
trie.Add(resList);
foundRes.AddRange(trie.Find(text));
// foundRes.AddRange(trie.Find(chars));
}
// update row[9] to true foreach res in foundRes
}
I think the fastest way to do this will be to:
An implementation of Aho-Corasick is available here.
I've written a simple program using that implementation from Github that tests the worst-case performance (that is, when none of the keywords are present in the text) to compare Aho-Corasick with
Contains()
):The results I get for a 64-bit RELEASE build (run outside the debugger) are as follows:
For this test case, it appears that Aho-Corasick is around 25 times faster than using
Contains()
. However, this is a somewhat artificial test case and your actual results may vary. You should instrument with your actual data to see if it really does help.Note that you can actually avoid loading the entire file into memory when using the Aho-Corasick implementation, because its
Find()
method accepts anIEnumerable<char>
.You can turn a the contents of a file into an
IEnumerable<char>
as follows:That actually removes all the newline characters, which is probably OK for your application. If you wanted to keep the newline characters, you'd have to put them back like so:
It would be worth comparing loading each file completely into memory with enumerating the chars in each file (as above) to see if there is any difference. For very large files, it may be important to avoid loading their entire contents into memory.
I just recently came across a similar problem to yours. Represent each searchable file as the following:
Then you can use it as it is:
Does file contains any of the string?
Do we have any file which contain any string?
Edit: Often
.AsParallel()
is worth trying for the problems like this (many files to test), however, in caseAsParallel()
doesn't bring any gain, just comment it out:You are reading the all the files for each string.
How about to do it the other way around? loop once through all the files:
Ater the OP edit:
As you've mentioned in the comment, You should first cummulate all the strings from the CSV files and then continue as suggested:
You might want to use also a
.Distinct
in case there's a chance that some of these words are not unique, to make it faster. But that depends on the size of this list, and how many words are indeed repeating themselves.This strongly depends on your exact use-case. If you are trying to match whole words, that can be discriminated easily, you could build some sort of hashed index (e.g.
Dictionary<string, WhatEver>
) you can search easily. Anyway - depending on the size - this can be very RAM intensive.The following code will give an idea how to structure this
FileReference
tracks the indices of the a string within a single file,ReferenceIndex
holds theFileReference
s for a single string. ForDictionary<TKey, TValue>
is hashed, accessing it is blazing fast. You can use these classes to build up aDictionary<string, ReferenceIndex>
that keeps track of all strings in the files and file references to these strings