Convert XML latin1 to UTF-8 and other way around

2019-01-29 13:23发布

问题:

I am trying to convert an XML file from Latin1 to UTF-8 and the other way around. I have been doing some tests, but I fail to succeed this. I'm using

Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml

But this is not converting anything. So I tried to give the encoding in the Get-Content, but Latin1 is not recognized in PowerShell (or that's what the error message is telling me). What's the best way to get this?

回答1:

The fastest method, especially with large XML files, is to use .NET System.IO.File class.

  • Use ReadAllText with explicitly provided Latin-1 encoding:

    [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) | 
        Set-Content r:\2.txt -Encoding UTF8
    
  • If your xml file has <?xml version="1.0" encoding="iso-8859-1" ?> it needs to be changed too:

    [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')).
        Replace('<?xml version="1.0" encoding="iso-8859-1"',
                '<?xml version="1.0" encoding="UTF-8"') | 
        Set-Content r:\2.txt -Encoding UTF8
    
  • To write Latin-1 encoding use WriteAllText with explicitly provided Latin-1 encoding:

    [IO.File]::WriteAllText(
        'r:\2.txt',
        [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8).
            Replace('<?xml version="1.0" encoding="UTF-8"',
                    '<?xml version="1.0" encoding="iso-8859-1"'),
        [Text.Encoding]::GetEncoding('iso-8859-1')
    )
    
  • Memory-efficient transcoding that can process files of any size (1TB? no problem!):

    function transcodeXML(
        [ValidateScript({Test-Path -Literal $_})]
        [string]$source,
        [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
        [string]$sourceEncoding,
    
        [ValidateScript({Test-Path -Literal $_ -IsValid})]
        [string]$target,
        [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
        [string]$targetEncoding
    ) {
        $reader = [IO.StreamReader]::new(
            $source,
            [Text.Encoding]::GetEncoding($sourceEncoding)
        )
        $writer = [IO.StreamWriter]::new(
            $target,
            $false, # don't append = overwrite
            [Text.Encoding]::GetEncoding($targetEncoding)
        )
        $buf = [char[]]::new(16MB)
    
        $nRead = $reader.Read($buf, 0, $buf.Length)
        $writer.Write(
            ([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace(
                [string]::new($buf, 0, $nRead),
                '$1' + $targetEncoding,
                1 # speedup: one replacement only
            )
        )
        while (!$reader.EndOfStream) {
            $nRead = $reader.Read($buf, 0, $buf.Length)
            $writer.Write($buf, 0, $nRead)
        }
        $reader.Close()
        $writer.Close()
    }
    

    Usage:

    transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
    


回答2:

I would suggest to pull the XML into an System.Xml.Linq.XDocument with the Load method and then change the Encoding property of the Declaration property (https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.declaration(v=vs.110).aspx) of that XDocument as needed or add one if Declaration is null and the finally you can use the Save method to save the changed document.