Sort very large text file in PowerShell

2019-04-08 18:32发布

问题:

I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.

The simplest and most obvious thing that comes to mind is

 Get-Content unsorted.txt | sort | get-unique > sorted.txt

I am guessing (without having tried it) that doing this using Get-Content would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader, but I'm curious if an efficient solution could be put together using that?

Thanks to anyone who might have a more efficient idea.

[edit]

I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.

回答1:

Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.

Let's set up a base line:

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:

Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663

Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.

$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
    while (($line = $reader.ReadLine()) -ne $null)
    {
        $t = $hs.Add($line)
    }
}
finally {
    $reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
    $f = New-Object System.IO.StreamWriter "d:\result2.txt";
    foreach ($s in $ls)
    {
        $f.WriteLine($s);
    }
}
finally
{
    $f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

this script produces:

read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802

On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.



回答2:

If each line of the log is prefixed with a timestamp, and the log messages don't contain embedded newlines (which would require special handling), I think it would take less memory and execution time to convert the timestamp from [String] to [DateTime] before sorting. The following assumes each log entry is of the format yyyy-MM-dd HH:mm:ss: <Message> (note that the HH format specifier is used for a 24-hour clock):

Get-Content unsorted.txt
    | ForEach-Object {
        # Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above
        if (-not [String]::IsNullOrEmpty($_))
        {
            # Split into at most two fields, even if the message itself contains ': '
            [String[]] $fields = $_ -split ': ', 2;

            return New-Object -TypeName 'PSObject' -Property @{
                Timestamp = [DateTime] $fields[0];
                Message   = $fields[1];
            };
        }
    } | Sort-Object -Property 'Timestamp', 'Message';

If you are processing the input file for interactive display purposes you can pipe the above into Out-GridView or Format-Table to view the results. If you need to save the sorted results you can pipe the above into the following:

    | ForEach-Object {
        # Reconstruct the log entry format of the input file
        return '{0:yyyy-MM-dd HH:mm:ss}: {1}' -f $_.Timestamp, $_.Message;
    } `
    | Out-File -Encoding 'UTF8' -FilePath 'sorted.txt';


回答3:

(Edited to be more clear based on n0rd's comments)

It's might be a memory issue. Since you're loading the entire file into memory to sort it (and adding the overhead of the pipe into Sort-Object and the pipe into Get-Unique), it's possible that you're hitting the memory limits of the machine and forcing it to page to disk, which will slow things down a lot. One thing you might consider is splitting the logs up before sorting them, and then splicing them back together.

This probably won't match your format exactly, but if I've got a large log file for, say, 8/16/2012 which spans several hours, I can split it up into a different file for each hour using something like this:

for($i=0; $i -le 23; $i++){ Get-Content .\u_ex120816.log | ? { $_ -match "^2012-08-16 $i`:" } | Set-Content -Path "$i.log" }

This is creating a regular expression for each hour of that day and dumping all the matching log entries into a smaller log file named by the hour (e.g. 16.log, 17.log).

Then I can run your process of sorting and getting unique entries on a much smaller subsets, which should run a lot faster:

 for($i=0; $i -le 23; $i++){ Get-Content "$i.log" | sort | get-unique > "$isorted.txt" }

And then you can splice them back together.

Depending on the frequency of the logs, it might make more sense to split them by day, or minute; the main thing is to get them into more manageable chunks for sorting.

Again, this only makes sense if you're hitting the memory limits of the machine (or if Sort-Object is using a really inefficient algorithm).