I've got a large text file (~20K lines, ~80 characters per line). I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.
The input file is CSVish with lines similar to:
A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;
The pattern in the array which I search each line in the input file for resemble the
XX000029
part of the line above.
My somewhat naïve function to achieve this goal looks like this currently:
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
try{
$FileContent = Get-Content $BigFile
}catch{
Write-Error $_
}
$IgnorePatterns | ForEach-Object {
$IgnoreId = $_.IgnoreId
$FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
Write-Host $FileContent.count
}
$FileContent | Set-Content "CleansedBigFile.txt"
}
This works, but is slow.
How can I make it quicker?
StreamReader
is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use[regex]::Escape()
as a precaution if regex control characters are present. Have to guess since we only see one pattern string.If
$IgnorePatterns
can easily be cast as strings this should working in place just fine. A small sample of what$regex
looks like would be:If
$IgnorePatterns
is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that toXX00002[7-9]
for instance.I don't know if the regex itself will provide an performance boost with 1500 possibles. The
StreamReader
is supposed to be the focus here. However I did sully the waters by usingAdd-Content
to the output which does not get any awards for being fast either (could use a stream writer in its place).Reader and Writer
I still have to test this to be sure it works but this just uses
streamreader
andstreamwriter
. If it does work better I am just going to replace the above code.You might need some error prevention in there for the streams but it does appear to work in place.