I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.
The simplest and most obvious thing that comes to mind is
Get-Content unsorted.txt | sort | get-unique > sorted.txt
I am guessing (without having tried it) that doing this using Get-Content
would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader
, but I'm curious if an efficient solution could be put together using that?
Thanks to anyone who might have a more efficient idea.
[edit]
I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.
Get-Content
is terribly ineffective for reading large files. Sort-Object
is not very fast, too.
Let's set up a base line:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader
to read file line by line into HashSet
which will remove duplicates, then copy data to List
and sort it there, then use StreamWriter
to dump results back.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
this script produces:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.
If each line of the log is prefixed with a timestamp, and the log messages don't contain embedded newlines (which would require special handling), I think it would take less memory and execution time to convert the timestamp from [String]
to [DateTime]
before sorting. The following assumes each log entry is of the format yyyy-MM-dd HH:mm:ss: <Message>
(note that the HH
format specifier is used for a 24-hour clock):
Get-Content unsorted.txt
| ForEach-Object {
# Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above
if (-not [String]::IsNullOrEmpty($_))
{
# Split into at most two fields, even if the message itself contains ': '
[String[]] $fields = $_ -split ': ', 2;
return New-Object -TypeName 'PSObject' -Property @{
Timestamp = [DateTime] $fields[0];
Message = $fields[1];
};
}
} | Sort-Object -Property 'Timestamp', 'Message';
If you are processing the input file for interactive display purposes you can pipe the above into Out-GridView
or Format-Table
to view the results. If you need to save the sorted results you can pipe the above into the following:
| ForEach-Object {
# Reconstruct the log entry format of the input file
return '{0:yyyy-MM-dd HH:mm:ss}: {1}' -f $_.Timestamp, $_.Message;
} `
| Out-File -Encoding 'UTF8' -FilePath 'sorted.txt';
(Edited to be more clear based on n0rd's comments)
It's might be a memory issue. Since you're loading the entire file into memory to sort it (and adding the overhead of the pipe into Sort-Object and the pipe into Get-Unique), it's possible that you're hitting the memory limits of the machine and forcing it to page to disk, which will slow things down a lot. One thing you might consider is splitting the logs up before sorting them, and then splicing them back together.
This probably won't match your format exactly, but if I've got a large log file for, say, 8/16/2012 which spans several hours, I can split it up into a different file for each hour using something like this:
for($i=0; $i -le 23; $i++){ Get-Content .\u_ex120816.log | ? { $_ -match "^2012-08-16 $i`:" } | Set-Content -Path "$i.log" }
This is creating a regular expression for each hour of that day and dumping all the matching log entries into a smaller log file named by the hour (e.g. 16.log, 17.log).
Then I can run your process of sorting and getting unique entries on a much smaller subsets, which should run a lot faster:
for($i=0; $i -le 23; $i++){ Get-Content "$i.log" | sort | get-unique > "$isorted.txt" }
And then you can splice them back together.
Depending on the frequency of the logs, it might make more sense to split them by day, or minute; the main thing is to get them into more manageable chunks for sorting.
Again, this only makes sense if you're hitting the memory limits of the machine (or if Sort-Object is using a really inefficient algorithm).