How can I split a text file using PowerShell?

2020-01-23 10:35发布

I need to split a large (500 MB) text file (a log4net exception file) into manageable chunks like 100 5 MB files would be fine.

I would think this should be a walk in the park for PowerShell. How can I do it?

标签: powershell
14条回答
▲ chillily
2楼-- · 2020-01-23 11:28

Many of these answers were too slow for my source files. My source files were SQL files between 10 MB and 800 MB that needed to split into files of roughly equal line counts.

I found some of the previous answers which use Add-Content to be quite slow. Waiting many hours for a split to finish wasn't uncommon.

I didn't try Typhlosaurus's answer, but it looks to only do splits by file size, not line count.

The following has suited my purposes.

$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length

Write-Host "Total Lines :" $totalLines

$skip = 0
$count = 100000; # Number of lines per file

# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")

while ($skip -le $totalLines) {
    $upper = $skip + $count - 1
    if ($upper -gt ($lines.Length - 1)) {
        $upper = $lines.Length - 1
    }

    # Write the lines
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])

    # Increment counters
    $skip += $count
    $fileNumber++
    $fileNumberString = $filenumber.ToString("000")
}

$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

For a 54 MB file, I get the output...

Reading source file...
Total Lines : 910030
Split complete in  1.7056578 seconds

I hope others looking for a simple, line-based splitting script that matches my requirements will find this useful.

查看更多
老娘就宠你
3楼-- · 2020-01-23 11:29

My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).

So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).

The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of text and was split into 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.

The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.

Option Explicit

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  
Private Const REPEAT_HEADER_ROW = True                
Private Const LINES_PER_PART = 500000                 

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart

sStart = Now()

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1

Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)

If REPEAT_HEADER_ROW Then
    iLineCounter = 1
    sHeaderLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sHeaderLine)
End If

Do While Not oInputFile.AtEndOfStream
    sLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sLine)
    iLineCounter = iLineCounter + 1
    If iLineCounter Mod LINES_PER_PART = 0 Then
        iOutputFile = iOutputFile + 1
        Call oOutputFile.Close()
        Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
        If REPEAT_HEADER_ROW Then
            Call oOutputFile.WriteLine(sHeaderLine)
        End If
    End If
Loop

Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())
查看更多
登录 后发表回答