Perl or Powershell how to convert from UCS-2 littl

2019-08-26 18:54发布

问题:

I'm using Windows ActivePerl and I can never seem to get conversion of a UCS2 little endian file to convert properly to utf-8. Best i could muster is what seems a proper conversion except that the first line which is 4 characters is mangled in strange chinese/japanese characters but the rest of file seems ok.

What I really want is to do oneliner /search/replace perl regex of the usual:

perl -pi.bak -e 's/replacethis/withthat/g;' my_ucs2file.txt

That won't work so I tried to first see if perl can do proper conversion and I'm stuck, i'm using:

perl -i.BAKS -MEncode -p -e "Encode::from_to($_, 'UCS-2', 'UTF-8')" My_UCS2file.txt

I tried using UCS2 or UCS-2LE but still can't get a proper conversion.

I recall somewhere someone had to delete a couple bits or something at the beginning of a UCS2 file to get conversion working but I can't remember...

When I tried PowerShell it complained it didn't know UCS2 / UCS-2 ...??

Appreciate any ideas. I noticed NotePad++ does open it and recognize it fine and I can edit and resave in notepad but there's no commandline ability...

回答1:

The one liner way is to avoid perl entirely and just use iconv -f UCS-2LE -t UTF-8 infile > outfile, but I'm not sure if that's available on Windows.

So, with perl as a one liner:

$ perl -Mopen="IN,:encoding(UCS-2LE),:std" -C2 -0777 -pe 1 infile > outfile
  • -0777 combined with -p reads entire files at a time, instead of a line at a time, which is one thing where you were going wrong - when your codepoints are 16 bits but you're treating them as 8 bit ones, finding the line separators is going to be problematic.
  • -C2 says to use UTF-8 for standard output.
  • -Mopen="IN,:encoding(UCS-2LE),:std" says that the default encoding for input streams, including standard input (So it'll work with redirected input not just files), is UCS-2LE. See the open pragma for details (In a script it'd be use open IN => ':encoding(UCS-2LE)', ':std';). Speaking of encoding, another issue you're having is that UCS-2 is a synonym for UCS-2BE. See Encode::Unicode for details.

So that just reads a file at a time, converting from UCS-2LE to perl's internal encoding, and prints it back out again as UTF-8.

If you didn't have to worry about Windows line ending conversion,

$ perl -MEncode -0777 -pe 'Encode::from_to($_, "UCS-2LE", "UTF-8")' infile > outfile

would also work.


If you want the output file to be in UCS-2LE too, and not just convert between encodings:

$ perl -Mopen="IO,:encoding(UCS-2LE),:std" -pe 's/what/ever/' infile > outfile