Is there a way to optimise my Powershell function

2019-07-01 17:55发布

问题:

I've got a large text file (~20K lines, ~80 characters per line). I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.

The input file is CSVish with lines similar to:

A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;   

The pattern in the array which I search each line in the input file for resemble the

XX000029

part of the line above.

My somewhat naïve function to achieve this goal looks like this currently:

function Remove-IdsFromFile {
  param(
    [Parameter(Mandatory=$true,Position=0)]
    [string]$BigFile,
    [Parameter(Mandatory=$true,Position=1)]
    [Object[]]$IgnorePatterns
  )

  try{
    $FileContent = Get-Content $BigFile
  }catch{
    Write-Error $_
  }

  $IgnorePatterns | ForEach-Object {
    $IgnoreId = $_.IgnoreId
    $FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
    Write-Host $FileContent.count
  }
  $FileContent | Set-Content "CleansedBigFile.txt"
}

This works, but is slow.

How can I make it quicker?

回答1:

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
    $reader = New-Object  System.IO.StreamReader($BigFile)

    $line=$reader.ReadLine()
    while ($line -ne $null)
    {
        # Check if the line should be output to file
        If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}

        # Attempt to read the next line. 
        $line=$reader.ReadLine()
    }

    $reader.close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}

StreamReader is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use [regex]::Escape() as a precaution if regex control characters are present. Have to guess since we only see one pattern string.

If $IgnorePatterns can easily be cast as strings this should working in place just fine. A small sample of what $regex looks like would be:

XX000029|XX000028|XX000027

If $IgnorePatterns is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that to XX00002[7-9] for instance.

I don't know if the regex itself will provide an performance boost with 1500 possibles. The StreamReader is supposed to be the focus here. However I did sully the waters by using Add-Content to the output which does not get any awards for being fast either (could use a stream writer in its place).

Reader and Writer

I still have to test this to be sure it works but this just uses streamreader and streamwriter. If it does work better I am just going to replace the above code.

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
        # Prepare the StreamReader
        $reader = New-Object System.IO.StreamReader($BigFile)

        #Prepare the StreamWriter
        $writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")

        $line=$reader.ReadLine()
        while ($line -ne $null)
        {
            # Check if the line should be output to file
            If($line -notmatch $regex){$writer.WriteLine($line)}

            # Attempt to read the next line. 
            $line=$reader.ReadLine()
        }

        # Don't cross the streams!
        $reader.Close()
        $writer.Close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}

You might need some error prevention in there for the streams but it does appear to work in place.