I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm
(where xx
is a two digit sequential number) and the number of files created varies from run to run.
$metadataPath = "\\ServerPath\foo"
function concatenateMetadata {
$cFile = $metadataPath + "whiconcat.csv"
Clear-Content $cFile
$metadataFiles = gci $metadataPath
$iterations = $metadataFiles.Count
for ($i=0;$i -le $iterations-1;$i++) {
$iFile = "whidata"+$i+".htm"
$FileExists = (Test-Path $metadataPath$iFile -PathType Leaf)
if (!($FileExists))
{
break
}
elseif ($FileExists)
{
Write-Host "Adding " $metadataPath$iFile
Get-Content $metadataPath$iFile | Out-File $cFile -append
Write-Host "to" $cfile
}
}
}
The whidataXX.htm
files are encoded UTF8, but my output file is encoded UTF16. When I view the file in Notepad, it appears correct, but when I view it in a Hex Editor, the Hex value 00
appears between each character, and when I pull the file into a Java program for processing, the file prints to the console with extra spaces between c h a r a c t e r s
.
First, is this normal for PowerShell? or is there something in the source files that would cause this?
Second, how would I fix this encoding problem in the code noted above?
First, the fact that you get 2 bytes per character indicates that fixed length UTF16 is being used. More accurately, it is called UCS-2. This article explains that file redirection in Powershell causes the output to be in UCS-2. See http://www.kongsli.net/nblog/2012/04/20/powershell-gotchas-redirect-to-file-encodes-in-unicode/. That same article also provides a fix.
The Out-* cmdlets (like Out-File) format the data, and the default format is unicode.
You can add an -Encoding parameter to Out-file:
or switch to Add-Content, which doesn't re-format